You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm proficient in other languages but I'm not very familiar with Rust, wanting to learn it. My use-case is likely a good starting point for using existing code to add functionality. I'll use https://doc.rust-lang.org/stable/book/ as a reference for coding standards. What I'm asking for help with is for gotchas or advice on best practices in where I'm trying to go with this.
I'd like to be able to add a simple filters to src/bin/parquet-rewrite.rs so that I can perhaps do either from
Initially (in)equality is my first "need", since I'm trying to remove specific IDs from one field. I'd love to be able to support a set of simple/basic filters, and if easy I'll extend into sets, missingness, etc.
basics: ==, !=, >, >=, <, and <= (recognizing that == and numeric can run into IEEE-754 issues)
boolean: Field and ! Field, though this may be easier as Field == true or Field == 1?
sets: Field in ('aa','bb','cc') and Field not in (..)
missingness: Field is not null, not sure if Field != null works
"and" is default between multiple conditionals, would like to support or
column present, really only useful in a generic sense when combined with another, such as Field not exist or Field > 5
(bigger stretch) paren grouping: Field1 != 'abc' or (Field2 == 'xyz' and Field3 > 100)
(Perhaps saying "SQL-like filter" might be sufficient for many things, I'm sure I'm missing something in that comparison :-)
I am not familiar with the Rust ecosystem, if bringing in another dependency to easily support this (such as parsing of my SQL-like code above) is required, I'll learn that.
Ultimately, I'm hoping for advice from experienced arrow/parquet users/rustaceans along these lines:
I've seen issues about "filter pushdown", is that a good place to start looking for adding this capability to parquet-rewrite?
When comparing different types of data (string vs number), some languages are quite permissive (auto-casting, not without logical errors), other languages complain or dump core, what are community best-practices I should be using to guard against problems here?
Some languages have fancy-looking "efficiencies" for iterating through data (comprehensions, list/vector-processing, etc), does this toolset (or rust in general) have strong recommendations for iteration over each row? For instance, iterating over all conditions for each row, or iterating rows for each conditional.
I think I can use row-wise operations within the existing batch-wise ops, is there a more efficient way to go?
If I'm even partially successful, I'm happy to submit a PR for inclusion here if others find value in this, but it's not a requirement for me (local use only). (Due to my lack of experience with rust, a good review from others would certainly be justified.)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I'm proficient in other languages but I'm not very familiar with Rust, wanting to learn it. My use-case is likely a good starting point for using existing code to add functionality. I'll use https://doc.rust-lang.org/stable/book/ as a reference for coding standards. What I'm asking for help with is for gotchas or advice on best practices in where I'm trying to go with this.
I'd like to be able to add a simple filters to
src/bin/parquet-rewrite.rs
so that I can perhaps do either fromInitially (in)equality is my first "need", since I'm trying to remove specific IDs from one field. I'd love to be able to support a set of simple/basic filters, and if easy I'll extend into sets, missingness, etc.
==
,!=
,>
,>=
,<
, and<=
(recognizing that==
and numeric can run into IEEE-754 issues)Field
and! Field
, though this may be easier asField == true
orField == 1
?Field in ('aa','bb','cc')
andField not in (..)
Field is not null
, not sure ifField != null
worksor
Field not exist or Field > 5
Field1 != 'abc' or (Field2 == 'xyz' and Field3 > 100)
(Perhaps saying "SQL-like filter" might be sufficient for many things, I'm sure I'm missing something in that comparison :-)
I am not familiar with the Rust ecosystem, if bringing in another dependency to easily support this (such as parsing of my SQL-like code above) is required, I'll learn that.
Ultimately, I'm hoping for advice from experienced arrow/parquet users/rustaceans along these lines:
parquet-rewrite
?If I'm even partially successful, I'm happy to submit a PR for inclusion here if others find value in this, but it's not a requirement for me (local use only). (Due to my lack of experience with rust, a good review from others would certainly be justified.)
Beta Was this translation helpful? Give feedback.
All reactions