Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Go][Parquet] Writing a Parquet file from a slice of structs #44

Open
tschaub opened this issue Sep 20, 2023 · 7 comments
Open

[Go][Parquet] Writing a Parquet file from a slice of structs #44

tschaub opened this issue Sep 20, 2023 · 7 comments

Comments

@tschaub
Copy link
Contributor

tschaub commented Sep 20, 2023

Describe the usage question you have. Please include as many useful details as possible.

I'm hoping to get suggestions on the best way to use the library to write a Parquet file given a slice of structs (Golang structs instead of Arrow's array.Struct).

The parquet.NewSchemaFromStruct() function looks like a useful starting point to generate a Parquet schema from a struct.

The pqarrow.NewFileWriter() function is helpful for creating a writer. And I can see how to convert a Parquet schema to an Arrow schema with the pqarrow.FromParquet() function.

The writer.WriteBuffered() method looks like a convenient way to write an Arrow record. So the gap is then to get from a slice of structs to the Arrow record.

I was looking for something like array.RecordFromSlice(). The array.RecordFromStructArray() looks useful, but I think I would have to do a fair bit of reflection to work with the struct builder. It looks like array.RecordFromJSON() does the same sort of reflection that I would have to do to use the struct builder.

I know it is not efficient, but I see that I can encode my struct slice as JSON and then generate a record from that. Here is a working test that uses the pqarrow.FileWriter to write a slice of structs as Parquet:

package pqarrow_test

import (
	"bytes"
	"encoding/json"
	"strings"
	"testing"

	"github.com/apache/arrow/go/v14/arrow/array"
	"github.com/apache/arrow/go/v14/arrow/memory"
	"github.com/apache/arrow/go/v14/parquet"
	"github.com/apache/arrow/go/v14/parquet/pqarrow"
	"github.com/apache/arrow/go/v14/parquet/schema"
	"github.com/stretchr/testify/require"
)

func TestFileWriterFromStructSlice(t *testing.T) {
	type Row struct {
		Name  string `parquet:"name=name, logical=String" json:"name"`
		Count int    `parquet:"name=count" json:"count"`
	}

	rows := []*Row{
		{
			Name:  "row-1",
			Count: 42,
		},
		{
			Name:  "row-2",
			Count: 100,
		},
	}

	data, err := json.Marshal(rows)
	require.NoError(t, err)

	parquetSchema, err := schema.NewSchemaFromStruct(rows[0])
	require.NoError(t, err)

	arrowSchema, err := pqarrow.FromParquet(parquetSchema, nil, nil)
	require.NoError(t, err)

	rec, _, err := array.RecordFromJSON(memory.DefaultAllocator, arrowSchema, strings.NewReader(string(data)))
	require.NoError(t, err)

	output := &bytes.Buffer{}

	writer, err := pqarrow.NewFileWriter(arrowSchema, output, parquet.NewWriterProperties(), pqarrow.DefaultWriterProps())
	require.NoError(t, err)

	require.NoError(t, writer.WriteBuffered(rec))
	require.NoError(t, writer.Close())
}

Again, I know there are more efficient ways to go from a slice of structs to a Parquet file. I'm just looking for advice on the most "ergonomic" way to use this library to do that. Am I missing a way to construct an Arrow record from a slice of structs? Or should I not be using the pqarrow package at all to do this?

Component(s)

Go, Parquet

@tschaub tschaub changed the title Writing a Parquet file from a slice of structs [Go][Parquet] Writing a Parquet file from a slice of structs Sep 21, 2023
@zeroshade
Copy link
Member

So, first and foremost: You're completely right, there isn't currently a good / efficient way to convert a slice of structs to an arrow record / struct array. My initial reaction would be to suggest converting the structs to JSON and then using RecordFromJSON, but you'd still have to create the reflection to actually generate the Arrow Schema (since all of the FromJSON methods require providing an existing arrow schema rather than having implemented the reflection to generate one myself). But you're write that this would be even less efficient.

The most "ergonomic" way to do this would likely to bypass the conversion to arrow in the first place and just use the column chunk writers directly from the file package.

That all said, it would probably be pretty useful if we did implement a full reflection based way of converting a struct to an arrow schema (like already exists for converting a struct to a parquet schema) or instantiate a RecordBuilder or StructBuilder from a struct and then allow appending a slice of that struct to the builder.

@chelseajonesr
Copy link
Contributor

@tschaub I have an initial version of this using reflection here, in case this is helpful:
https://github.com/chelseajonesr/rfarrow

I'm using this for a specific use case so some conversions may not have been tested; feel free to let me know if anything doesn't work.

@tschaub
Copy link
Contributor Author

tschaub commented Oct 4, 2023

Looks useful, @chelseajonesr.

My only real current use case has been to create Parquet data for tests. I've written a test.ParquetFromJSON() function for this purpose. Maybe also specific to my use case, but this relies on incrementally building up a schema based on a configurable number of input (JSON) rows - to allow for cases where nulls may be present in early rows and the appropriate field type isn't known until reading more data. So I have an Arrow schema builder for this. This does't yet cover all the types you might encounter with an arbitrary struct - I'm just adding support for the cases I need to handle.

So while I think it could be useful to have something in this library to generate Arrow data from a slice of structs (to compliment the current parquet.NewSchemaFromStruct() function), I just wanted to say that I don't have an urgent need for this now. I'll close this unless someone else thinks it is a worthwhile issue to keep open.

@eest
Copy link

eest commented Oct 5, 2023

I found this issue trying to figure out how to write out a parquet file based on a go struct as well. In my case I already have some data in a arrow data structure which I have been able to write out via the pqarrow package, but then I had a need to also write out some other data where I am not using arrow data structures for the storage. Given the separation between the pqarrow convenience package and the more general file package I was not sure if the latter aims to be a "general" parquet writer for use even if arrow is not used for the input data.

As a comparison the https://github.com/xitongsys/parquet-go package allows you to basically do:

pw, err := writer.NewParquetWriterFromWriter(..., new(myStruct), ...)
pw.Write(myStructInstance)

I was trying to see if there was something like that present in file but was not able to find it. Of course it would make sense if what I was looking for is out of scope for this project and I should just use another package for that, like the one above.

@zeroshade
Copy link
Member

@eest You're correct that the file package is intended to be a "general" parquet writer for use even if arrow is not used for the input data. The idea was that all "arrow" specific things would be contained in the pqarrow package, while the rest are general parquet packages.

Issues I had encountered with https://github.com/xitongsys/parquet-go was actually the motivating factor which led to me creating the parquet package here, and is why I had created the initial methods that let you create a parquet schema from a struct / vice versa. Having a writer that will allow you to write instances of a struct was out of scope when I was originally building this package, but given the nature of Go it does seem like a reasonable addition to get written. I just never did it because I didn't see any interest in that until now.

If either of you would be willing to put together a PR for adding this functionality, I'd happily help iterate on it with you. Unfortunately I don't have the bandwidth currently to work on adding that functionality myself at this time.

@zeroshade
Copy link
Member

I could also see @chelseajonesr's repo as definitely a good starting point for this kind of functionality and would happily accept a refined version into the Arrow lib itself if a PR is made

@chelseajonesr
Copy link
Contributor

@zeroshade Sure, I'd be happy to - will also need to fill in a few data types I left out and expand the tests; should be able to start on that next week.

@assignUser assignUser transferred this issue from apache/arrow Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants