Skip to content

[BUG]The data was wrong when reading CSV with double quotation marks in some case #20812

@GaryShen2008

Description

@GaryShen2008

Describe the bug
When reading CSV with double quotation marks, there're some cases, the data became wrong even I read the whole line as single string column.
If user expects a single string column for a CSV file, it's strange that we have some parsing behaviors about double quotation marks.
There're different symptoms but all point to the problem in double quotation handling.

Steps/Code to reproduce bug
Here's the C++ code to read CSV, it's simply reading a CSV file with single column schema, then print each line.
i created a test case in cpp/tests/io/csv_test.cpp with below code to run my test.

// read a CSV file with single string column schema
std::string filepath = "test.csv";
cudf::io::csv_reader_options in_opts =
    cudf::io::csv_reader_options::builder(cudf::io::source_info{filepath})
      .header(-1)  // No header
      .names({"c0"})  // Single column named c0
      .dtypes({dtype<cudf::string_view>()})  // String type
      .delimiter('\t');  // Tab delimiter instead of comma
  auto result = cudf::io::read_csv(in_opts);
  auto const view = result.tbl->view();

std::cout << "Number of columns: " << view.num_columns() << std::endl;
std::cout << "Number of rows: " << view.num_rows() << std::endl;

std::cout << "--- Column Data ---" << std::endl;
for (cudf::size_type col_idx = 0; col_idx < view.num_columns(); ++col_idx) {
    auto const& col = view.column(col_idx);
    std::cout << "Column [" << col_idx << "] " 
              << result.metadata.schema_info[col_idx].name << ":" << std::endl;
    
    if (col.type().id() == type_id::STRING) {
      // For string columns, we can print the data
      auto result = cudf::test::to_strings(col);
      for (size_t line_num = 0; line_num < result.size(); ++line_num) {
        std::cout << result[line_num] << std::endl;
      }
    } else {
      std::cout << "  (Type " << cudf::type_to_name(col.type()) 
                << " - data not printed in this test)" << std::endl;
    }
    std::cout << "Column [" << col_idx << "] end" << std::endl;
  }

Case 1: Additonal "\n" at the end of the row when reading a line with odd number of double quotes
CSV file content:

lt_qeury=o5K","last_ts" end

The wrong output:

lt_qeury=o5K","last_ts" end \n

Case 2: Only show 1 double quote when reading 2 continuous double quotes, and it causes some following lines missing until other double quotes show up.
CSV file content:

"packageName":"test","type":"test","url_scheme":false,"referer":"",test
Below line will be empty
test
test
Until this line with another quote, "test=test" "test"
This line will be shown

The output:

"packageName":"test","type":"test","url_scheme":false,"referer":",test
This line will be shown

Expected behavior
cuDF can output as same as the orignal content in the CSV file. Pandas outputs the same content as the original CSV files.

Environment overview (please complete the following information)

  • Environment location: [Bare-metal]
  • Method of cuDF install: [from source]

Environment details
GPU: Titan V
Driver: 575.57.08
CUDA: 12.9

Additional context
Same problem in spark-rapids since spark-rapids calls the cuDF API to read CSV.
This is reported from one of our customers.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingcuIOcuIO issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions