-
Notifications
You must be signed in to change notification settings - Fork 997
Description
Describe the bug
When reading CSV with double quotation marks, there're some cases, the data became wrong even I read the whole line as single string column.
If user expects a single string column for a CSV file, it's strange that we have some parsing behaviors about double quotation marks.
There're different symptoms but all point to the problem in double quotation handling.
Steps/Code to reproduce bug
Here's the C++ code to read CSV, it's simply reading a CSV file with single column schema, then print each line.
i created a test case in cpp/tests/io/csv_test.cpp with below code to run my test.
// read a CSV file with single string column schema
std::string filepath = "test.csv";
cudf::io::csv_reader_options in_opts =
cudf::io::csv_reader_options::builder(cudf::io::source_info{filepath})
.header(-1) // No header
.names({"c0"}) // Single column named c0
.dtypes({dtype<cudf::string_view>()}) // String type
.delimiter('\t'); // Tab delimiter instead of comma
auto result = cudf::io::read_csv(in_opts);
auto const view = result.tbl->view();
std::cout << "Number of columns: " << view.num_columns() << std::endl;
std::cout << "Number of rows: " << view.num_rows() << std::endl;
std::cout << "--- Column Data ---" << std::endl;
for (cudf::size_type col_idx = 0; col_idx < view.num_columns(); ++col_idx) {
auto const& col = view.column(col_idx);
std::cout << "Column [" << col_idx << "] "
<< result.metadata.schema_info[col_idx].name << ":" << std::endl;
if (col.type().id() == type_id::STRING) {
// For string columns, we can print the data
auto result = cudf::test::to_strings(col);
for (size_t line_num = 0; line_num < result.size(); ++line_num) {
std::cout << result[line_num] << std::endl;
}
} else {
std::cout << " (Type " << cudf::type_to_name(col.type())
<< " - data not printed in this test)" << std::endl;
}
std::cout << "Column [" << col_idx << "] end" << std::endl;
}
Case 1: Additonal "\n" at the end of the row when reading a line with odd number of double quotes
CSV file content:
lt_qeury=o5K","last_ts" end
The wrong output:
lt_qeury=o5K","last_ts" end \n
Case 2: Only show 1 double quote when reading 2 continuous double quotes, and it causes some following lines missing until other double quotes show up.
CSV file content:
"packageName":"test","type":"test","url_scheme":false,"referer":"",test
Below line will be empty
test
test
Until this line with another quote, "test=test" "test"
This line will be shown
The output:
"packageName":"test","type":"test","url_scheme":false,"referer":",test
This line will be shown
Expected behavior
cuDF can output as same as the orignal content in the CSV file. Pandas outputs the same content as the original CSV files.
Environment overview (please complete the following information)
- Environment location: [Bare-metal]
- Method of cuDF install: [from source]
Environment details
GPU: Titan V
Driver: 575.57.08
CUDA: 12.9
Additional context
Same problem in spark-rapids since spark-rapids calls the cuDF API to read CSV.
This is reported from one of our customers.