Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for chunksize #17

Open
mohr023 opened this issue Feb 21, 2020 · 2 comments
Open

Support for chunksize #17

mohr023 opened this issue Feb 21, 2020 · 2 comments

Comments

@mohr023
Copy link

mohr023 commented Feb 21, 2020

Considering some of SIA files like PA##.dbc, loading them into a dataframe directly basically burns 20gb RAM.

Should we add support for pandas Dataframe's chunksize parameter, to handle this correctly? If so, can you identify some sort of caveat about this approach @fccoelho ?

@mohr023
Copy link
Author

mohr023 commented Feb 21, 2020

One alternative I see for implementing this is by returning a generator of DataFrames, such as this:

(pd.DataFrame([next(iter(testdbf.records)) for j in range(i,i+chunksize)]) for i in range(0,len(testdbf), chunksize))

For this test, I'm using DBF object from dbfread.

@fccoelho
Copy link
Collaborator

This is a real problem @mohr023. If we iterate over the DBF record as we read, we would also need to iterate over them when saving the cachefile, and cannot return the full dataframe after downloading.

If you have a good Idea for solving this, feel free to submit a Pull-request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants