Skip to content

Use TextIOWrapper for opening file from repository in text mode #6847

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

ahkole
Copy link

@ahkole ahkole commented Apr 30, 2025

In the current implementation, the contents of a file are wrapped in a StringIO if a user tries to open a file from the repository in text mode (i.e. with folderdata.open('somefile.txt', 'r')). This means the entire file is loaded into memory, which can cause a significant increase in memory usage for large text files. This PR proposed to instead wrap the binary stream in a TextIOWrapper for streaming based access in text mode to avoid having to load the entire file into memory.

A small change was also necessary for the cif-file interface, since apparently the CifParser from pymatgen only accepts filenames or StringIO streams.

This PR depends on a PR for the disk-objectstore dependency (see aiidateam/disk-objectstore#192) that adds some attributes to custom streaming types that are required by TextIOWrapper.

@ahkole
Copy link
Author

ahkole commented Apr 30, 2025

See also https://aiida.discourse.group/t/memory-usage-folderdata-open/524/22 for a discussion on this issue.

@khsrali khsrali self-assigned this Apr 30, 2025
Copy link
Contributor

@khsrali khsrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @ahkole

Looks good to me already.
Just to make sure if it works as expected, could you please to add your tests into the repository? --maybe requires some modification, but then we'll know for sure if something's break

@ahkole
Copy link
Author

ahkole commented May 1, 2025

I've added a small test that for the reading in text mode. I think that should cover the changes, right? The small change to the cif-interface is already covered by an existing test (that is also how I discovered that that needed to be changed).

Note btw that the changes in the disk-objectstore PR are needed for the tests to pass. I guess a bump of the version number is required in pyproject.toml once a new version of disk-objectstore with these changes has been released?

Copy link
Contributor

@khsrali khsrali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @ahkole,

To link the correct dependency to your PR, in pyproject.toml remove the version:

'disk-objectstore~=1.2' -> 'disk-objectstore'

And add a hook to your fork and branch

[tool.uv.sources]
disk-objectstore = {git = "https://github.com/ahkole/disk-objectstore.git", branch = "XX"}

and for enviroment.yml just get rid of the version:
'disk-objectstore~=1.2' -> 'disk-objectstore'

In the end, using pre-commit run it automatically updates the uv file

@ahkole ahkole requested review from unkcpz and agoscinski as code owners May 1, 2025 18:41
@ahkole
Copy link
Author

ahkole commented May 1, 2025

Thanks a lot @ahkole,

To link the correct dependency to your PR, in pyproject.toml remove the version:

'disk-objectstore~=1.2' -> 'disk-objectstore'

And add a hook to your fork and branch

[tool.uv.sources]
disk-objectstore = {git = "https://github.com/ahkole/disk-objectstore.git", branch = "XX"}

and for enviroment.yml just get rid of the version: 'disk-objectstore~=1.2' -> 'disk-objectstore'

In the end, using pre-commit run it automatically updates the uv file

Ah, I wasn't aware this was possible. Thanks for the suggestion! I think this should be set correctly now.

@ahkole
Copy link
Author

ahkole commented May 1, 2025

Unrelated to this PR, but can I propose btw to update the "Setting up your development environment" page of the Contributor wiki to mention using uv as your Python package manager (for example as a recommendation)? As far as I can see that wiki still mostly mentions pip even though uv is much more convenient (as I have now learned).

@khsrali
Copy link
Contributor

khsrali commented May 6, 2025

Thanks a lot @ahkole,

Changes looks good to me, I run the workflow now to see if tests pass.

@khsrali
Copy link
Contributor

khsrali commented May 6, 2025

Ok, we're hittinh this github bug again,
workflow will not run because I forgot to click on "Approve and run", before pushing 🥲
I do a dirty trick

@khsrali
Copy link
Contributor

khsrali commented May 6, 2025

Good! the trick worked, now let's see the tests..

Copy link

codecov bot commented May 6, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 78.59%. Comparing base (b618cb7) to head (547f8ee).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6847      +/-   ##
==========================================
- Coverage   78.59%   78.59%   -0.00%     
==========================================
  Files         567      567              
  Lines       43092    43093       +1     
==========================================
- Hits        33866    33863       -3     
- Misses       9226     9230       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@khsrali
Copy link
Contributor

khsrali commented May 7, 2025

@ahkole, thanks a gain for your contribution,

One of the tests is failing for sqlite storage backend.

To reproduce run this:
pytest --db-backend sqlite tests/restapi/test_statistics.py
While for psql gonna pass:
pytest --db-backend psql tests/restapi/test_statistics.py

It appears that restapi server has problems to properly read and load the data from a sqlite backend.
To debug, I'd suggest check out this file:
'src/aiida/storage/sqlite_dos/backend.py'

____________________________ test_count_consistency ____________________________
[gw1] linux -- Python 3.9.22 /home/runner/work/aiida-core/aiida-core/.venv/bin/python3

restapi_server = <function restapi_server.<locals>._restapi_server at 0x7f43030aff70>
server_url = <function server_url.<locals>._server_url at 0x7f43002acc10>

    @pytest.mark.usefixtures('populate_restapi_database')
    def test_count_consistency(restapi_server, server_url):
        """Test the consistency in values between full_type_count and statistics"""
        server = restapi_server()
        server_thread = Thread(target=server.serve_forever)
    
        _server_url = server_url(port=server.server_port)
    
        try:
            server_thread.start()
            type_count_response = requests.get(f'{_server_url}/nodes/full_types_count', timeout=10)
            statistics_response = requests.get(f'{_server_url}/nodes/statistics', timeout=10)
        finally:
            server.shutdown()
    
>       type_count_dict = linearize_namespace(type_count_response.json()['data'])

tests/restapi/test_statistics.py:51: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/restapi/test_statistics.py:31: in linearize_namespace
    linearize_namespace(subspace, linear_namespace)
tests/restapi/test_statistics.py:31: in linearize_namespace
    linearize_namespace(subspace, linear_namespace)
tests/restapi/test_statistics.py:31: in linearize_namespace
    linearize_namespace(subspace, linear_namespace)
tests/restapi/test_statistics.py:31: in linearize_namespace
    linearize_namespace(subspace, linear_namespace)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

tree_namespace = {'counter': 5, 'full_type': None, 'label': 'Unregistered', 'namespace': '~no-entry-point~', ...}
linear_namespace = {'data.': 156, 'data.Data.': 20, 'data.core.': [136](https://github.com/aiidateam/aiida-core/actions/runs/14864011669/job/41740571178?pr=6847#step:7:137), 'data.core.array.': 5, ...}

    def linearize_namespace(tree_namespace, linear_namespace=None):
        """Linearize a branched namespace with full_type, count, and subspaces"""
        if linear_namespace is None:
            linear_namespace = {}
    
        full_type = tree_namespace['full_type']
>       while full_type[-1] != '.':
E       TypeError: 'NoneType' object is not subscriptable

tests/restapi/test_statistics.py:23: TypeError

@ahkole
Copy link
Author

ahkole commented May 7, 2025

Hi @khsrali ,

I have tried to run the test as you suggested but on my system the test passes, see below:

> pytest --verbose --db-backend sqlite tests/restapi/test_statistics.py
===================================================================================== test session starts =====================================================================================
platform linux -- Python 3.11.9, pytest-7.4.4, pluggy-1.5.0 -- .venv/bin/python3
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: .
configfile: pyproject.toml
plugins: asyncio-0.16.0, xdist-3.6.1, timeout-2.3.1, datadir-1.5.0, rerunfailures-12.0, cov-4.1.0, regressions-2.7.0, benchmark-4.0.0
timeout: 240.0s
timeout method: thread
timeout func_only: False
collected 1 item                                                                                                                                                                              

tests/restapi/test_statistics.py::test_count_consistency PASSED                                                                                                                         [100%]

====================================================================================== warnings summary =======================================================================================
tests/restapi/test_statistics.py::test_count_consistency
  .venv/lib/python3.11/site-packages/CifFile/CifFile_module.py:356: SyntaxWarning: "is not" with a literal. Did you mean "!="?
    if do_imports is not 'No':

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================== slowest 50 durations =====================================================================================

(3 durations < 1s hidden.  Use -vv to show these durations.)
================================================================================ 1 passed, 1 warning in 1.71s =================================================================================

I seem to be using a different Python version though (3.11 vs 3.9 in the results you shared). Don't know if that can make the difference?

@khsrali
Copy link
Contributor

khsrali commented May 7, 2025

yeah, strange, I tried to run locally with 3.9, also passes for me.
I don't understand why it fails on the CI 🤔

@ahkole
Copy link
Author

ahkole commented May 8, 2025

Is there a way to look more into this type_count_response = requests.get(f'{_server_url}/nodes/full_types_count', timeout=10) request and figure out why full_type if none when this gets executed on the CI host?

@khsrali khsrali requested a review from eimrek May 8, 2025 14:01
@khsrali
Copy link
Contributor

khsrali commented May 8, 2025

Is there a way to look more into this type_count_response = requests.get(f'{_server_url}/nodes/full_types_count', timeout=10) request and figure out why full_type if none when this gets executed on the CI host?

Yes, that would be the way to go when debugging. The pain is that we cannot reproduce it locally, so it's more difficult to understand it and resolve it.

@eimrek, we don't understand why test fails here.
I thought since you are an expert of restspi, maybe you can have a look (?) 😄

@eimrek
Copy link
Member

eimrek commented May 12, 2025

hi! sorry, super busy with other stuff, so i won't be able to take a detailed look.

But this seems to be related to #6747 and #6763. Do you have #6763 included in this PR? maybe if you rebase on main this issue will be fixed?

@ahkole
Copy link
Author

ahkole commented May 12, 2025

@eimrek Thanks for the reply! If I check the history of my branch (https://github.com/ahkole/aiida-core/commits/objectstore-textio/), it does seem to include #6763. It was also last synced with main on May 1st, which was after the merge of #6763 I believe.

@ahkole
Copy link
Author

ahkole commented May 12, 2025

Is it somehow possible to run interactively on the CI host to try to debug the issue?

@khsrali
Copy link
Contributor

khsrali commented May 14, 2025

@ahkole tests pass now 🤷‍♂️
I didn't understand in the end what caused the issue. I added this incident to our flaky test collector.

For me this PR looks good already. Let me have a look at your other PR in diskobjectstore.
Once that's merged, we'll need to make a release. Later we can come back here and merge this.

@ahkole
Copy link
Author

ahkole commented May 14, 2025

@khsrali that's great to hear! Let me know what you think of the disk-objectstore PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants