Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parquet-go/parquet-go vs apache/arrow/go/parquet #252

Open
chenquan opened this issue Jan 16, 2025 · 1 comment
Open

parquet-go/parquet-go vs apache/arrow/go/parquet #252

chenquan opened this issue Jan 16, 2025 · 1 comment

Comments

@chenquan
Copy link

Describe the usage question you have. Please include as many useful details as possible.

Hello, I'm writing a data archiving tool, and I expect to archive data from a trillion-row database. I'm wondering whether I should use parquet-go/parquet-go or apache/arrow/go/parquet to solve my problem?

Component(s)

Parquet

@zeroshade
Copy link
Member

It honestly depends. Both are released under the Apache License 2.0, so licensing isn't going to be an issue or differ between them.

Ultimately it's going to be a question of Features, Performance and Maintenance.

Features

For example, if you need to support the parquet encryption capabilities, then you should use this library apache/arrow-go/v18/parquet as parquet-go/parquet-go doesn't support the encryption functionality to my knowledge.

If you need to leverage bloom filters, then you'll need to use parquet-go/parquet-go instead of this library as we don't have support for bloom filters yet (though I am currently working on that!).

If you are already leveraging Apache Arrow itself for anything (ADBC for database interaction, Flight/FlightSQL for wire protocol, interacting with DuckDB or other Arrow-compatible/native compute engines, etc), then this library is going to be more performant and beneficial due to the direct integration it has with Arrow through the pqarrow package. If your data is laid out in Go structs, then parquet-go/parquet-go has simpler API as I haven't had the time to enable writers to accept Go structs for writing yet.

I have plans on improving the public APIs for the writers of this library to better utilize generics, while parquet-go already has such APIs that utilize generics. And so on.

Performance

As far as Performance and memory usage, I haven't benchmarked anything significant between the two libraries so I can't speak to any comparison there. On this, I invite you to perform comparisons with your use case. That said, if you do find that parquet-go is more performant or has better memory usage than this library, please come back and let me know! I'd love to attempt to address any performance/memory usage issues that you come across as great pains have been taken to optimize this library as much as possible (just as parquet-go has, with different solutions to some things)

Maintenance

Both projects are actively maintained looking at the frequency of commits. While I can't speak to the maintainers of parquet-go I can say that this project is highly connected to the Parquet PMC and Apache community as far as keeping up with any changes to the Parquet format, having input into the wider community, and so on. That said, technically this library is considered the official Go library for Parquet.

I hope the above helps you make a decision, or at least gives you a direction for exploration. In the end, I'm interested in what you end up going with and why. Particularly, if you do end up going with parquet-go over this library, I would like to know why you make that decision so I can address those issues and make this library better and more desireable 😄

Honestly, thanks for filing this issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants