[BUG]: Delete collection resource leak (single-node Chroma) #3297

tazarov · 2024-12-13T16:30:55Z

Description of changes

The delete collection logic slightly changes to accomodate the fix without breaking the transactional integrity of self._sysdb.delete_collection. The chromadb.segment.SegmentManager.delete_segments had to change to accept the list of segments to delete instead of collection_id.

Summarize the changes made by this PR.

Improvements & Bug fixes
- Fixes the resource leak when deleting a collection

Test plan

How are these changes tested?

Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Documentation Changes

N/A

github-actions · 2024-12-13T16:31:09Z

tazarov · 2024-12-13T16:31:15Z

[BUG]: Local segment manager memory leak fix #3340
[BUG]: Delete collection resource leak (single-node Chroma) #3297 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

rohitcpbot

Thanks for identifying the leak and raising the fix. I did not see this earlier so did not review earlier. my miss. Reviewed it now.

rohitcpbot · 2025-01-04T05:18:00Z

chromadb/api/segment.py

        )

        if existing:
+            segments = self._sysdb.get_segments(collection=existing[0].id)


I feel we should try not to call sysdb for getting segments. It adds extra call to the backend for distributed chroma.

Seeing the current code, I see we are already calling sysdb.get_segments() from the manager, so you are simply moving that line here, and not adding extra calls. But i feel we can do better.

Do you think we should just call delete_segment() from delete_collection() ?
So we can add this snippet back -

for s in self._manager.delete_segments(existing[0]["id"]): self._sysdb.delete_segment(s)

and do a no-op inside delete_segments() in db/impl/grpc/client.py
Will that fix the leak ?

@rohitcpbot,

using this snippet:

for s in self._manager.delete_segments(existing[0]["id"]): self._sysdb.delete_segment(s)

Makes sense however we revert back to a non-atomic deletion of sysdb resources. In the above snippet we'd delete the segments separately from deleting the collection, which I wanted to avoid on purpose which is why I pulled the get of the segments here before the were atomically deleted as part of self._sysdb.delete_collection.

Why do you think that this would cause extra calls in the distributed backend?

chromadb/segment/impl/manager/distributed.py

tazarov · 2025-01-08T14:27:13Z

chromadb/segment/impl/manager/distributed.py

    def delete_segments(self, collection_id: UUID) -> Sequence[UUID]:
-        segments = self._sysdb.get_segments(collection=collection_id)
-        return [s["id"] for s in segments]
+        return []  # noop


@HammadB, talked with @rohitcpbot and he mentioned that this should be noop, is this fine or should I revert back to the older version with distributed sysdb query?

tazarov · 2025-01-08T14:28:20Z

chromadb/api/segment.py

        )

        if existing:
+            self._manager.delete_segments(collection_id=existing[0].id)


@rohitcpbot, this is the actual change as we discussed. rest is just black formatting changes.

Thanks @tazarov.
If possible leave a note with the following comment or similar -
"""
This call will delete segment related data that is stored locally and cannot be part of the atomic SQL transaction.
It is a NoOp for the distributed sysdb implementation.
Omitting this call will lead to leak of segment related resources.
"""

Can you answer something for me - If the process crashes immediately after self._manager.delete_segments(collection_id=existing[0].id)
Then the actual entries in SQL are not deleted, which means the collection is not deleted.
Now if user issues a Get or Query, will the local manager work correctly ?

If the fix to make local manager work in the above failure scenario is non trivial then we could leave a note here, and take it up as a separate task. But it will be good to know the state of the Db with above change.

The same scenario would had to be thought through even with your earlier changes of doing the local manager delete after the sysdb delete... where the sql could have gone through but the local manager did not because of a crash.. leading to a leak.

@rohitcpbot, local manager has two segments for each collection:

sqlite - this will actually delete the segment from segments -

chroma/chromadb/segment/impl/metadata/sqlite.py

Line 595 in d50a942

def delete(self) -> None:

hnsw - it will delete the directory where the the HNSW is stored but will not delete the segment from segments dir

So here is a diagram to explain the point of failure:

The main problem as I see it in the current impl (with possible solutions):

I think the only foolproof way to remove it all is possibly to wrap it all in a single transaction all the way from segment. Then if the physical dir removal fails we'll rollback the whole sqlite transaction.

As a side note, on Windows deleting the segment dir right after closing file handles frequently fails.

rohitcpbot · 2025-01-09T16:52:04Z

Thanks @tazarov, i left a comment to add a comment, and also a question. We should be good to merge after that.

tazarov · 2025-01-09T19:35:41Z

Thanks @tazarov, i left a comment to add a comment, and also a question. We should be good to merge after that.

After some deliberation I think a good course of action is to make it all atomic by making the sysdb.delete_collection take a callback which removes the physical resources for single-node Chroma and is noop for distributed. Here's how this can be look like:

self._sysdb.delete_collection(
                existing[0].id, tenant=tenant, 
                database=database,
                lambda collection_id: self._manager.delete_segments(collection_id=collection_id)
            )

and inside delete_collection well do the callback at the end of the transaction (or the beginning) but the gist is that any failure to remove the segments will cause the deletion of a collection and segments to be rolled back.

Wdyt?

tazarov · 2025-01-24T15:10:41Z

@rohitcpbot,

As discussed offline here are two promising approaches:

callback

self._sysdb.delete_collection(
                existing[0].id, tenant=tenant, database=database, lambda collection_id: self._manager.delete_segments(collection_id=collection_id)
            )

Possible failures points:

HNSW delete - we can prevent TX rollback with try/except
Commit failure - this is low chance but if the commit fails then we have the HNSW deleted

return segments

segments = self._sysdb.delete_collection(
                existing[0].id, tenant=tenant, database=database
            )
            self._manager.delete_segments(segments=segments)

Possible failure points:

sysdb delete - all collection data is left as-is
HNSW delete - sysdb data is gone, all collection "physical data" (hnsw/docs/metadata/fts) are left as-is (this assumes HNSW segment is deleted first although we don't have any segment ordering guarantees so it can be last)
Metadata delete - sys db and HNSW are gone, some collection physical data (docs/metadata/fts) are left as-is

Tip

In the above read left as-is as leaked resources

let me know once you get the chance to discuss it in the office.

rohitcpbot · 2025-02-04T14:44:43Z

@tazarov - as discussed offline, lets just fix the leaks for now and then we’ll see how the atomicity will work across various db entries and files. can open a new ticket for fixing the atomicity and assign to you or me. thanks!

tazarov · 2025-02-05T18:30:16Z

@rohitcpbot, I think this is it. The simplest form of the fix where we just change the places of deleting segments first then sysdb.

# Conflicts: # chromadb/api/segment.py

rohitcpbot

LGTM

…ore#3297) ## Description of changes Closes chroma-core#3296 The delete collection logic slightly changes to accomodate the fix without breaking the transactional integrity of `self._sysdb.delete_collection`. The `chromadb.segment.SegmentManager.delete_segments` had to change to accept the list of segments to delete instead of `collection_id`. ![image](https://github.com/user-attachments/assets/54dd30cf-8106-406f-a020-f53badf4dcf3) *Summarize the changes made by this PR.* - Improvements & Bug fixes - Fixes the resource leak when deleting a collection ## Test plan *How are these changes tested?* - [x] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust ## Documentation Changes N/A

tazarov requested review from HammadB and rohitcpbot December 13, 2024 16:35

tazarov added bug Something isn't working Local Chroma An improvement to Local (single node) Chroma labels Dec 16, 2024

This was referenced Dec 19, 2024

[BUG]: Local segment manager memory leak fix #3337

Closed

[BUG]: Local segment manager memory leak fix #3340

Closed

tazarov force-pushed the trayan-12-13-fix_delete_collection_resource_leak branch 2 times, most recently from 2e113a0 to b53dadb Compare January 3, 2025 07:48

rohitcpbot reviewed Jan 4, 2025

View reviewed changes

tazarov commented Jan 6, 2025

View reviewed changes

chromadb/segment/impl/manager/distributed.py Outdated Show resolved Hide resolved

tazarov force-pushed the trayan-12-13-fix_delete_collection_resource_leak branch from b53dadb to ba07228 Compare January 8, 2025 08:21

tazarov commented Jan 8, 2025

View reviewed changes

tazarov requested a review from rohitcpbot January 8, 2025 16:34

tazarov force-pushed the trayan-12-13-fix_delete_collection_resource_leak branch 2 times, most recently from 9fd2b3c to 788a07f Compare January 9, 2025 19:00

tazarov force-pushed the trayan-12-13-fix_delete_collection_resource_leak branch from 788a07f to 6163095 Compare February 5, 2025 16:54

tazarov assigned rohitcpbot Feb 5, 2025

fix: delete_collection resource leak

6fdf23d

# Conflicts: # chromadb/api/segment.py

tazarov force-pushed the trayan-12-13-fix_delete_collection_resource_leak branch from c1d60e3 to 6fdf23d Compare February 5, 2025 19:12

rohitcpbot approved these changes Feb 11, 2025

View reviewed changes

tazarov merged commit 64b1a3d into main Feb 11, 2025
103 checks passed

tazarov mentioned this pull request Apr 22, 2025

[Bug]: Chroma DB Memory size increasing issue #4024

Open

[BUG]: Delete collection resource leak (single-node Chroma) #3297

[BUG]: Delete collection resource leak (single-node Chroma) #3297

Uh oh!

Conversation

tazarov commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of changes

Test plan

Documentation Changes

Uh oh!

github-actions bot commented Dec 13, 2024

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

tazarov commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rohitcpbot left a comment

Choose a reason for hiding this comment

Uh oh!

rohitcpbot Jan 4, 2025

Choose a reason for hiding this comment

Uh oh!

tazarov Jan 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tazarov Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

tazarov Jan 8, 2025

Choose a reason for hiding this comment

Uh oh!

rohitcpbot Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

rohitcpbot Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

tazarov Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tazarov Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

rohitcpbot commented Jan 9, 2025

Uh oh!

tazarov commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tazarov commented Jan 24, 2025

callback

return segments

Uh oh!

rohitcpbot commented Feb 4, 2025

Uh oh!

tazarov commented Feb 5, 2025

Uh oh!

rohitcpbot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tazarov commented Dec 13, 2024 •

edited

Loading

tazarov commented Dec 13, 2024 •

edited

Loading

tazarov Jan 9, 2025 •

edited

Loading

tazarov commented Jan 9, 2025 •

edited

Loading