Skip to content

Strategies for importing a .csv file that contains commas in one of its columns? #366

Closed
@kburchfiel

Description

@kburchfiel

I am trying to use the DataFrame library to import a .csv file that contains commas, single quotes, and double quotes within one of its columns.

I imported the file using the following code (here's the full script for reference if needed):

df_Bible.read("../Files/CPDB_for_TTTB.csv", io_format::csv2);

I then used the following code to save this DataFrame to a .csv file:

df_Bible.write<long, double, std::string, 
std::size_t>("../Files/CPDB_for_TTTB_from_program.csv", 
io_format::csv2);

When I reviewed the output, I noticed that the entries within the 'Verse' column were getting cut off at their commas. In addition, because these commas were being interpreted as column separators, certain values in columns to the right of the 'Verse' column were getting pushed into other columns.

For instance, here are the original first 5 rows within the 'Verse' column:

In the beginning, God created heaven and earth.
But the earth was empty and unoccupied, and darknesses were over the face of the abyss; and so the Spirit of God was brought over the waters.
And God said, "Let there be light." And light became.
And God saw the light, that it was good; and so he divided the light from the darknesses.
And he called the light, 'Day,' and the darknesses, 'Night.' And it became evening and morning, one day.

And here are the same 5 rows within the 'Verse' column that my script output:

"In the beginning
"But the earth was empty and unoccupied
"And God said
"And God saw the light
"And he called the light

Is it possible to correctly parse .csv files that have commas in certain columns? I can reshape the original file as needed (e.g. by changing the separator from ',' to '\t'), but it would be great to be able to process files that contain commas as part of their column data.

A few additional notes:

  1. This Data Frame was originally created via Pandas' to_csv() function. I believe Pandas can read it in without any trouble, but I'll double check this.
  2. I believe that, within the Verse column, a comma can be identified as a separator if it is preceded by " or """ (three double quote marks). Double quotes are being used to mark the start and end of this particular string column (but not other string columns, which don't have commas), and double quotes within verses are themselves preceded by double quotes.
  3. Allowing the user to specify an alternative separator like '/t' might be the easiest solution here, but being able to support comma-separated files whose fields also contain columns would be great (yet likely more work).
  4. I think this CSV file complies to RFC 4180, so if the DataFrame library could accommodate files that are consistent with this standard, that would be great!

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions