Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to save pandas dataframe as different file types #53

Closed
asnaylor opened this issue Jul 8, 2019 · 11 comments · Fixed by #57
Closed

Add ability to save pandas dataframe as different file types #53

asnaylor opened this issue Jul 8, 2019 · 11 comments · Fixed by #57

Comments

@asnaylor
Copy link
Collaborator

asnaylor commented Jul 8, 2019

Currently fast_curator saves the resulting pandas dataframe as a .csv file. The user should have option to save the output dataframe in a variety of different file formats.

@asnaylor
Copy link
Collaborator Author

asnaylor commented Jul 8, 2019

I was looking through all the ways to save pandas dataframes on their site and i was wondering do you want to support them all @benkrikler ? https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

@asnaylor
Copy link
Collaborator Author

asnaylor commented Jul 8, 2019

Also my thoughts were having the config yaml file like this:

s1logs2:
    binning:
        - {in: singleScatters.s1Area_phd, out: s1}}
    file:
        - {type:".pkl.compress",compression:"gzip"}

and support:

s1logs2:
    binning:
        - {in: singleScatters.s1Area_phd, out: s1}}
    file:
        - type:".h5"

so the user specified the file type and they could also specify other options like compression or etc. that are relevant to saving the pandasframe

@benkrikler
Copy link
Member

Hey @asnaylor.

I was looking through all the ways to save pandas dataframes on their site and i was wondering do you want to support them all @benkrikler ? https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

This is a nice idea and I agree that other outputs are worth considering since you're looking at it now anyway. For me, the key points for how we implement this sort of change are: 1) backwards compatibility, so existing YAML configs should still work without changes, 2) conciseness for simple cases: if you just want to use hd5 then you shouldn't need to type much else. Similarly, if you want to save both to cvs and hd5 it shouldn't be hard to do that.

  1. Point 1) means we want to add an option for the config files like you're suggesting, though I'd call it file_format or out_file_format maybe (just file seems a bit ambiguous to me).
  2. Point 2) above means it needs to default to producing csv files.
  3. Point 2) above also means we want to support several types of values for this option: a single dictionary with the output file format's config, a list of such dictionaries, possible a single string for simple formats, like csv.

Examples like yours are helpful, so I'm thinking the following values for file_format should be allowed:

file_format: "csv"
file_format: {type: ".pkl.compress", compression: "gzip"}
file_format: 
  - {type: ".pkl.compress", compression: "gzip"}
  - csv
  - type: hdf
    key: df

To see how this might be implemented, maybe something like the way we handle binned dataframe weight descriptions could work: https://github.com/FAST-HEP/fast-carpenter/blob/master/fast_carpenter/summary/binning_config.py#L80-L90.

I have a suspicion that this is leaning into a realm where we should really refactor how the outputs are handled in a more general way than just for the binned dataframe stage. In particular, I'm wondering if this would relate to #52. But I think we can perhaps put this in now and then refactor it a bit later when we tackle that issue, as the more streamlined way to get there.

@asnaylor
Copy link
Collaborator Author

asnaylor commented Jul 8, 2019

Agreed, i think file_format is clear.

  1. Point 1) means we want to add an option for the config files like you're suggesting, though I'd call it file_format or out_file_format maybe (just file seems a bit ambiguous to me).

Yep, my plan is just to have the default:

file_format: {type:'csv',ext:'.csv',float_format:'%.17g'}
  1. Point 2) above means it needs to default to producing csv files.

Okay, i'll work on having the multiple input options.

  1. Point 2) above also means we want to support several types of values for this option: a single dictionary with the output file format's config, a list of such dictionaries, possible a single string for simple formats, like csv.

I'm think the yaml file should require both type and ext as we can easily support all the different pandas output files with getattr using type.

getattr(output,"to_%s"%self.type)(self.filename+self.ext, **self.kwargs)

@asnaylor
Copy link
Collaborator Author

asnaylor commented Jul 9, 2019

Or maybe instead of requiring both type and ext in the yaml file, I could just create a dictionary with the type and relevant ext so only the type would be required in the yaml file.

@benkrikler
Copy link
Member

Sounds reasonable to me, except that I'm not very sure why we need both type and ext. Wouldn't the file's type normally determine its extension? What do you think?

@asnaylor
Copy link
Collaborator Author

asnaylor commented Jul 9, 2019

So ext and type are the same in alot of cases but what if the user want's to save in compress pickle format. I don't want to hard code in all the different pickle extensions which are supported in pandas.dataframe.to_pickle.

Sounds reasonable to me, except that I'm not very sure why we need both type and ext. Wouldn't the file's type normally determine its extension? What do you think?

I think with the dictionary of {'type':'ext'} you would only need to specify the type and it will get the default ext but some types have multiple supported file extensions, so having ext allows the user to choose

@benkrikler
Copy link
Member

I think we could in principle calculate the value of ext given the type and the other options given to us, and in general I think it's good practice not to put redundant options in (since it makes it possible to have conflicting configuration in that case). To me, I think there are three options to us then:

  1. We only add type not ext, and then calculate ext ourselves based on the type and other options --- seems like a lot of work, but doable and safest.
  2. We add both type and ext, with ext defaulting to None but overridable. If None we simply use type as its value. I see this give us the most flexibility and with this default approach is reasonable for most cases, but it means if we ever want to fix it up towards option 1 then we'll potentially break people's configs.
  3. We only add support for type at this point and simply use that directly as the extension. Then later on we can make this more elaborate or worst comes to the worst add in an ext option.

I think the original goal of this issue was just to add hd5 support, and so maybe we just keep it simple for now and use option 3 there? When we refactor things for issue #52 then we can revisit this sort of thing maybe?

@asnaylor
Copy link
Collaborator Author

asnaylor commented Jul 9, 2019

I agree, it should be kept simple and keep backwards compatibility to prevent breaking people's configs, however if we just have type as the direct file extension we're going to have to figure out which pandas command to execute to save the file as i don't think there is a global save file function. For example is the user specified type: ".h5" then we would have to use something like a dictionary (the reverse of 1.) to get .to_hdf() to save the file.

@asnaylor
Copy link
Collaborator Author

I gave option 3 a little more though and it not too crazy to do.

This snippet of code should allow the user to specify just the file extension in type and it will figure out if the file extension doesn't map directly to the pandas function and if not replace the save_func with the correct pandas function unless the extension is not supported. This also allows users to write to extensions with compression such as .pkl.gzip

save_func=self.type.split('.')[1]

valid_ext={'xlsx':'excel','h5':'hdf','msg':'msgpack','dta':'stata','pkl':'pickle','p':'pickle'}
if self.type.split('.')[1] in valid_ext:
       save_func=valid_ext[self.type.split('.')[1]]

try:
      getattr(output,"to_%s"%save_func)(self.filename, **self.kwargs)
except AttributeError as err:
      print("Incorrect file format: %s"%err)

@asnaylor
Copy link
Collaborator Author

@benkrikler I have Coded option 3 on my branch https://github.com/asnaylor/fast-carpenter but i am unable to execute fast_carpenter properly when i build it from source #54. So i've been unable to fully test the code. I tested the code in parts but not all fully together in fast_carpenter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants