Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spaces at the end of CSV files #290

Open
MBuchalik opened this issue Dec 15, 2020 · 5 comments
Open

Spaces at the end of CSV files #290

MBuchalik opened this issue Dec 15, 2020 · 5 comments

Comments

@MBuchalik
Copy link
Collaborator

I tried loading a CSV file (using the Adapter) with a structure like this:

column1,column2,column3
first,second,third




There are a few newlines after the second row.

Expected result

I would expect to only get two rows in the result like so:

[
  ["column1", "column2", "column3"],
  ["first", "second", "third"]
]

Actual result

I am getting multiple "empty" rows at the end:

[
  ["column1","column2","column3"],
  ["first","second","third"],
  [""],
  [""],
  [""]
]

Here is a sample CSV file for you to try it out:
spaces.zip

A similar issue happens if there is an empty line between the rows like here: spaces2.zip. But I am not sure if we should clean that up or not. (Intuitively, I would say "yes")

@sonallux
Copy link
Contributor

As CSV is a very loose standard with many variants, I would argue removing and keeping the blank lines are both valid options. As there is an option CsvParser.Feature.SKIP_BLANK_LINES in our CSV parser, I would add a new parameter to our CSV format to let the user decide what he wants.

@MBuchalik
Copy link
Collaborator Author

Do you have a sample data set for me where the blank lines actually have a "meaning"? I am just trying to figure out in which cases you would actually need the empty rows.

@sonallux
Copy link
Contributor

sonallux commented Dec 15, 2020

Do you have a sample data set for me where the blank lines actually have a "meaning"?

Yes, I can think of a case where you are doing exactly 50 measurements with a defined order. Some measurements might not lead to value (e.g. measuring device defect), which is represented as a black line. When now removing the black lines, the information about the missing measurement is lost.

@MBuchalik
Copy link
Collaborator Author

Thanks for the example 👍

We should probably talk about this in the next meeting before implementing something. Reason: We will create a feature for data validation ("column 2 is a number > 5") in the near future. Allowing empty lines could make this feature harder to implement/understand.

Altenatively, we could add a "row number" column. Then, we could always skip the empty lines and users would still be able to figure out that something got skipped. But that could also introduce other issues...

@mathiaszinnen
Copy link
Contributor

The issue is has been posted here.
We should try CsvGenerator.Feature.STRICT_CHECK_FOR_QUOTING first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants