-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Go][Parquet] Writing a Parquet file from a slice of structs #44
Comments
So, first and foremost: You're completely right, there isn't currently a good / efficient way to convert a slice of structs to an arrow record / struct array. My initial reaction would be to suggest converting the structs to JSON and then using The most "ergonomic" way to do this would likely to bypass the conversion to arrow in the first place and just use the column chunk writers directly from the That all said, it would probably be pretty useful if we did implement a full reflection based way of converting a struct to an arrow schema (like already exists for converting a struct to a parquet schema) or instantiate a |
@tschaub I have an initial version of this using reflection here, in case this is helpful: I'm using this for a specific use case so some conversions may not have been tested; feel free to let me know if anything doesn't work. |
Looks useful, @chelseajonesr. My only real current use case has been to create Parquet data for tests. I've written a So while I think it could be useful to have something in this library to generate Arrow data from a slice of structs (to compliment the current |
I found this issue trying to figure out how to write out a parquet file based on a go struct as well. In my case I already have some data in a arrow data structure which I have been able to write out via the As a comparison the https://github.com/xitongsys/parquet-go package allows you to basically do:
I was trying to see if there was something like that present in |
@eest You're correct that the Issues I had encountered with https://github.com/xitongsys/parquet-go was actually the motivating factor which led to me creating the parquet package here, and is why I had created the initial methods that let you create a parquet schema from a struct / vice versa. Having a writer that will allow you to write instances of a struct was out of scope when I was originally building this package, but given the nature of Go it does seem like a reasonable addition to get written. I just never did it because I didn't see any interest in that until now. If either of you would be willing to put together a PR for adding this functionality, I'd happily help iterate on it with you. Unfortunately I don't have the bandwidth currently to work on adding that functionality myself at this time. |
I could also see @chelseajonesr's repo as definitely a good starting point for this kind of functionality and would happily accept a refined version into the Arrow lib itself if a PR is made |
@zeroshade Sure, I'd be happy to - will also need to fill in a few data types I left out and expand the tests; should be able to start on that next week. |
Describe the usage question you have. Please include as many useful details as possible.
I'm hoping to get suggestions on the best way to use the library to write a Parquet file given a slice of structs (Golang structs instead of Arrow's array.Struct).
The
parquet.NewSchemaFromStruct()
function looks like a useful starting point to generate a Parquet schema from a struct.The
pqarrow.NewFileWriter()
function is helpful for creating a writer. And I can see how to convert a Parquet schema to an Arrow schema with thepqarrow.FromParquet()
function.The
writer.WriteBuffered()
method looks like a convenient way to write an Arrow record. So the gap is then to get from a slice of structs to the Arrow record.I was looking for something like
array.RecordFromSlice()
. Thearray.RecordFromStructArray()
looks useful, but I think I would have to do a fair bit of reflection to work with the struct builder. It looks likearray.RecordFromJSON()
does the same sort of reflection that I would have to do to use the struct builder.I know it is not efficient, but I see that I can encode my struct slice as JSON and then generate a record from that. Here is a working test that uses the
pqarrow.FileWriter
to write a slice of structs as Parquet:Again, I know there are more efficient ways to go from a slice of structs to a Parquet file. I'm just looking for advice on the most "ergonomic" way to use this library to do that. Am I missing a way to construct an Arrow record from a slice of structs? Or should I not be using the
pqarrow
package at all to do this?Component(s)
Go, Parquet
The text was updated successfully, but these errors were encountered: