Skip to content

Unexpected memory retention when reading slices of dataframes #2348

@rmlynx

Description

@rmlynx

Describe the bug

When reading and slicing a subset of a large DataFrame:

  1. The entire DataFrame appears to be loaded into memory.
  2. A slice is taken and returned, likely as a view retaining a reference to the original.
  3. If this operation is repeated in a loop and each slice is stored (e.g., in a list), the original large DataFrames are never deallocated.

This causes cumulative memory usage to increase continuously, eventually leading to an out-of-memory crash.

Steps/Code to Reproduce

1. Create demo data

import pandas as pd
import numpy as np
from datetime import datetime


def generate_random_dataframe(n_rows=25, n_cols=10, start_date="2000-01-01", freq="D"):
    """
    Generate a random DataFrame with a datetime index.

    Args:
        n_rows (int): Number of rows in the DataFrame.
        n_cols (int): Number of columns in the DataFrame.
        start_date (str): Start date for the datetime index.
        freq (str): Frequency for the datetime index (e.g., 'D' for daily, 'H' for hourly).

    Returns:
        pd.DataFrame: A random DataFrame with datetime as the index.
    """
    # Generate column names
    cols = [f"COL_{i}" for i in range(n_cols)]

    # Generate random data
    data = np.random.randint(0, 100, size=(n_rows, n_cols))

    # Create a datetime index
    index = pd.date_range(start=start_date, periods=n_rows, freq=freq)

    # Create the DataFrame
    df = pd.DataFrame(data, columns=cols, index=index)

    return df

Write a 20 years times 10 000 columns DataFrame with random data to Arctic:

df = generate_random_dataframe(
    n_rows=255 * 20, n_cols=10_000, start_date="1990-01-01", freq="D"
)
df.head()
import arcticdb as adb

uri = "lmdb://tmp/arcticdb_leak"
ac = adb.Arctic(uri)

library = ac.get_library("demo_lib", create_if_missing=True)

library.write("test_frame", df)

2. Read Data

Helper function to get memory usage of a list of DataFrames:

def get_total_dataframe_size_gb(df_list):
    """
    Calculate total memory usage of a list of DataFrames in gigabytes.

    Parameters:
        df_list (list of pd.DataFrame): List of DataFrames.

    Returns:
        float: Total size in GB.
    """
    total_bytes = sum(df.memory_usage(deep=True).sum() for df in df_list)
    return total_bytes / (1024**3)

Read the full dataframe for reference:

from_storage_df = library.read("test_frame").data

print(f"Shape of data: {from_storage_df.shape}")
print(f"Size of data: {get_total_dataframe_size_gb([from_storage_df]):.2f} GB")

Read only a slice:

n_rows_to_read = 3

from_date = from_storage_df.index[1]
to_date = from_storage_df.index[n_rows_to_read]

small_df = library.read(
    "test_frame",
    date_range=(from_date, to_date),
).data

print(f"Shape of fetched subset of data: {small_df.shape}")
print(
    f"Size of fetched subset of data: {get_total_dataframe_size_gb([small_df]):.4f} GB"
)

Now read the small slice in a loop and save results in a list:

retrieved_data = []
n_times_to_fetch = 100

for i in range(n_times_to_fetch):
    small_df = library.read(
        "test_frame",
        date_range=(
            from_date,
            to_date,
        ),
    ).data
    retrieved_data.append(small_df)

    if i % 10 == 0:
        print(f"Fetched small subset {i} times")
        print(
            f"    Total size of retrieved data so far: {get_total_dataframe_size_gb(retrieved_data):.2f} GB"
        )

print()
print(
    f"Total size of retrieved data: {get_total_dataframe_size_gb(retrieved_data):.2f} GB"
)

Expected Results

Memory usage should in total increase by ca 0.0002 GB * 300 = 0.06 GB -- instead, it increases by several hundred MB per iteration (until out of memory).

Image

OS, Python Version and ArcticDB Version

  • Linux
  • Python 3.10.12
  • ArcticDB 5.2.3

Backend storage used

LMDB

Additional Context

We're able to bypass the problem using adb.QueryBuilder().date_range((from_date, to_date)) as an argument instead to library.read, but it's not clear if this is the intended way to do it and why the most obvious way to read a slice of a DataFrame would cause this memory leak.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions