-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to save pandas dataframe as different file types #53
Comments
I was looking through all the ways to save pandas dataframes on their site and i was wondering do you want to support them all @benkrikler ? https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html |
Also my thoughts were having the config yaml file like this:
and support:
so the user specified the file type and they could also specify other options like compression or etc. that are relevant to saving the pandasframe |
Hey @asnaylor.
This is a nice idea and I agree that other outputs are worth considering since you're looking at it now anyway. For me, the key points for how we implement this sort of change are: 1) backwards compatibility, so existing YAML configs should still work without changes, 2) conciseness for simple cases: if you just want to use hd5 then you shouldn't need to type much else. Similarly, if you want to save both to cvs and hd5 it shouldn't be hard to do that.
Examples like yours are helpful, so I'm thinking the following values for file_format: "csv" file_format: {type: ".pkl.compress", compression: "gzip"} file_format:
- {type: ".pkl.compress", compression: "gzip"}
- csv
- type: hdf
key: df To see how this might be implemented, maybe something like the way we handle binned dataframe weight descriptions could work: https://github.com/FAST-HEP/fast-carpenter/blob/master/fast_carpenter/summary/binning_config.py#L80-L90. I have a suspicion that this is leaning into a realm where we should really refactor how the outputs are handled in a more general way than just for the binned dataframe stage. In particular, I'm wondering if this would relate to #52. But I think we can perhaps put this in now and then refactor it a bit later when we tackle that issue, as the more streamlined way to get there. |
Agreed, i think
Yep, my plan is just to have the default: file_format: {type:'csv',ext:'.csv',float_format:'%.17g'}
Okay, i'll work on having the multiple input options.
I'm think the yaml file should require both getattr(output,"to_%s"%self.type)(self.filename+self.ext, **self.kwargs) |
Or maybe instead of requiring both |
Sounds reasonable to me, except that I'm not very sure why we need both |
So
I think with the dictionary of |
I think we could in principle calculate the value of
I think the original goal of this issue was just to add hd5 support, and so maybe we just keep it simple for now and use option 3 there? When we refactor things for issue #52 then we can revisit this sort of thing maybe? |
I agree, it should be kept simple and keep backwards compatibility to prevent breaking people's configs, however if we just have |
I gave option 3 a little more though and it not too crazy to do. This snippet of code should allow the user to specify just the file extension in save_func=self.type.split('.')[1]
valid_ext={'xlsx':'excel','h5':'hdf','msg':'msgpack','dta':'stata','pkl':'pickle','p':'pickle'}
if self.type.split('.')[1] in valid_ext:
save_func=valid_ext[self.type.split('.')[1]]
try:
getattr(output,"to_%s"%save_func)(self.filename, **self.kwargs)
except AttributeError as err:
print("Incorrect file format: %s"%err) |
@benkrikler I have Coded option 3 on my branch https://github.com/asnaylor/fast-carpenter but i am unable to execute |
Currently
fast_curator
saves the resulting pandas dataframe as a.csv
file. The user should have option to save the output dataframe in a variety of different file formats.The text was updated successfully, but these errors were encountered: