-
Notifications
You must be signed in to change notification settings - Fork 642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Be tolerant to cell formatter problems #969
base: master
Are you sure you want to change the base?
Conversation
Hi @lujop, Thanks for the PR. I noticed recently that the US CDC is using Tablesaw to analyze covid cases, and I would not want them to get bad results because they forgot to check a log :). I think maybe it needs to be an option, with the default behavior to be failing, perhaps with an error message that suggests the "log and continue" option. What do you think? |
Sure @lwhite1. It has a lot of sense to have the option as you said. |
Sounds good. Thanks
…On Thu, Aug 19, 2021 at 12:58 PM Joan Pujol ***@***.***> wrote:
Sure @lwhite1 <https://github.com/lwhite1>. It has a lot of sense to have
the option as you said.
I will do that when I've some time and update the PR.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#969 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA2FPAX7NAGPZUZ5TGSESN3T5UZ2BANCNFSM5CDEYIIA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
@lujop The more i think about this option, the more of an edge case it seems to me. Do you analyze a lot of data where it's ok to throw away rows because they come from a spreadsheet with formatting issues? Or is the problem that cell-formats are so frequently unreliable that it's hard to get many spreadsheets imported? If that's the case, I wonder if using cell-formats didn't make things less reliable. |
I think that neither of the two. My use case it's quite complex and layered, and it's not that I can throw all the incorrect data, most incorrect data can't be processed and some can be with warnings. To give some context to what I do has several layers of error checking/parsing:
I recognize that my use case for sure is not the most orthodox one. But I also think it's not a completely strange one. Because when people use excel I think that we can assume that in most cases that data has been fill by humans, and then not be perfect. I can assure you that they can be very creative in doing small things that completely breaking the parsing. In particular, before improving the type detection logic on excel it was a complete nightmare because a lot of excels couldn't be parsed at all. Returning to your original question about cell formatters, it's not very common to have errors, I checked on last day's logs and I hadn't had any, but I encountered some cases that now I can't exactly remember. But was something with a date format that was not supported by poi if I don't remember bad. I really think that cell formatters and current logic are a lot better than what we had before. And in the worse cases, they don't introduce less parseability, because they are only used when some conversion must be done and before that, there was always an error. Excel doesn't allow to read integer cells as string, or string cells as numbers, you must do the conversions yourself. And the most robust way to do them I think that is by cell formatters. Because for example if you want to see a number as a string (for example some internal code that sometimes can be a number and sometimes not) the way the user wants to have the string for sure is the way it views it in excel and for that the formatters are needed. |
FWIW, I think it should be possible to read with duplicate column names. There is an option called .allowDuplicateColumns() which may not work, or maybe works but was not added to the excel importer. That is something that I think should be fixed. Similarly, with CSVs, I believe the system adds names for unnamed columns (C1 to Cn, I think). Again, I think this is something that is worth doing for Excel if it's possible using POI.
This is a problem with the way people create Excel files. It's not something Tablesaw can fix
I certainly agree, but I think the preferred solution is to fix the error message rather than log and proceed. Is that not possible?
I think many CSVs are created by people converting spreadsheets. I know I do that every time rather than use the excel importer. The problem here is that Excel gives the user the ability to apply a cell format, but the method for doing so (selecting a range and changing it manually) is very error prone. It's easy to add a row after you apply the format for example. The problem more generally is that people are sloppy (or overly fancy in their spreadsheets) in infinite ways. One case I had to fix manually yesterday was they had written comments at the bottom of the spreadsheet. This (and adding calculations at the bottom or the side) is pretty common. So is using cell merging in a way that causes problems. Even Excel cannot export to CSV reliably a spreadsheet with merged cells. I think also that this creates new inconsistencies with the other import methods, which is undesirable. Should all import methods log and continue on errors? Should it be all errors instead of just this one kind? This is also an argument against the approach of specifying a range of cells to import.
I agree with the concept, but I disagree with the approach of continually broadening the constraints on what is imported.
For my education, what happens if I create a spreadsheet of two columns, say country and population, and I don't apply any cell formatting? Does this method work? |
Description
After the changes introduced in 74a54aa where CellFormatters were used to convert types, some excels had format options that make POI throw errors without being able to parse the entire sheet.
In that cases, it's safer to log the error and discard the value that can't be correctly interpreted and converted