You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TL;DR: There is an inconsistency in Pinot when a table defines a column as a string but it is populated reading a parquet file whose column type is binary without the string annotation.
Introduction:
I'm trying to improve the current Pinot implementation of ClickHouse/ClickBench. In that benchmark the input data can be read from csvs, tsvs, json or parquet files. As usual, Pinot requires to split the data before ingesting. The current implementation imports data from a tsv, but it seems easier and faster to import from parquet. Even better, ClickHouse already provides the data split in 100 parquet files. But the metadata on the parquet file is not correct (link to the issue). String columns are marked as binaries without String annotation.
The problem:
When Pinot tries to populate a String column from a parquet file, it reads the values from the parquet files, applies some conversions that return an Object and applies a stringify conversion to that Object (ie calling Object.toString). But the value that is read from parquet depends on the metadata associated with the parquet column. In parquet String columns are Binary columns with an annotation that marks them as a String. Other annotations can be applied to mark the binary column as an enum or UUID.
Pinot uses these annotations to decide how to read the value, without knowing the type of the Pinot column where the data will be stored. The code that does that is here. In case the binary column is not annotated with the String annotation, it is read as a binary (there some other variants but they are not important in this discussion). When the value is read as a String, the Java type will be String. When the value is read as a binary, the Java type will be byte[].
Once the value is extracted, Pinot stringifies the value before storing it. If the value was String, this conversion is a noop. But if the value wasn't marked as String in parquet, the extracted value is a byte[] and therefore the conversion returns a description of the bytes. I didn't find where the conversion is applied, but the result is that Pinot stores a String that is not the bytes read as UTF-8.
This is problematic because there is no clear indicator that the import is going to change the data and there is no failure. The user may think that everything is ok and after ingesting terabytes of data he/she may discover that the data stored is not the same that the one he/she wanted to store.
The same problem may also be applied to other input formats and other column types, but I didn't try it.
Proposed solutions:
Add some checks to the import process that verifies that the extracted Java types matches with the expected column type. In this case this would produce a fail fast error that should be clear to the user, which can take proper adjustments (like adding the corresponding annotation to parquet).
Add a property to org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReaderConfig that let the user to customize how columns should be read.
IMO the error should be mandatory and the ability to cast the columns should be optional.
The text was updated successfully, but these errors were encountered:
TL;DR: There is an inconsistency in Pinot when a table defines a column as a string but it is populated reading a parquet file whose column type is binary without the string annotation.
Introduction:
I'm trying to improve the current Pinot implementation of ClickHouse/ClickBench. In that benchmark the input data can be read from csvs, tsvs, json or parquet files. As usual, Pinot requires to split the data before ingesting. The current implementation imports data from a tsv, but it seems easier and faster to import from parquet. Even better, ClickHouse already provides the data split in 100 parquet files. But the metadata on the parquet file is not correct (link to the issue). String columns are marked as binaries without String annotation.
The problem:
When Pinot tries to populate a String column from a parquet file, it reads the values from the parquet files, applies some conversions that return an Object and applies a stringify conversion to that Object (ie calling
Object.toString
). But the value that is read from parquet depends on the metadata associated with the parquet column. In parquet String columns are Binary columns with an annotation that marks them as a String. Other annotations can be applied to mark the binary column as an enum or UUID.Pinot uses these annotations to decide how to read the value, without knowing the type of the Pinot column where the data will be stored. The code that does that is here. In case the binary column is not annotated with the String annotation, it is read as a binary (there some other variants but they are not important in this discussion). When the value is read as a String, the Java type will be String. When the value is read as a binary, the Java type will be
byte[]
.Once the value is extracted, Pinot stringifies the value before storing it. If the value was String, this conversion is a noop. But if the value wasn't marked as String in parquet, the extracted value is a byte[] and therefore the conversion returns a description of the bytes. I didn't find where the conversion is applied, but the result is that Pinot stores a String that is not the bytes read as UTF-8.
This is problematic because there is no clear indicator that the import is going to change the data and there is no failure. The user may think that everything is ok and after ingesting terabytes of data he/she may discover that the data stored is not the same that the one he/she wanted to store.
The same problem may also be applied to other input formats and other column types, but I didn't try it.
Proposed solutions:
org.apache.pinot.plugin.inputformat.parquet.ParquetRecordReaderConfig
that let the user to customize how columns should be read.IMO the error should be mandatory and the ability to cast the columns should be optional.
The text was updated successfully, but these errors were encountered: