Tabitha reporting incorrect page number for rows in XLS file #28

twilco · 2018-10-03T16:36:59Z

Tabitha appears to be assigning the wrong row number to rows in the XLS file found at this URL: https://s3.amazonaws.com/widen-ingester-dev/tabithaDoesn'tSeeFirstPageSanitized.xls

In our application, we only care about rows in the first sheet (page). Our code looks like this:

try (InputStream delimFileStream = amazonS3.getObject(s3BucketPath, s3Key).getObjectContent();
     RowReader reader = RowReaders.open(delimFileStream, fileName).blockingGet()) {

    if (reader == null) {
        ...stuff...
    }

    Iterable<Row> rows = reader.rows().takeWhile(row -> row.page() == 0).blockingIterable();
    for (Row row : rows) {
        ...logic to act upon rows in first sheet/page...
    }
}

In the attached spreadsheet, there is data in the first sheet, but Tabitha seems to think everything instead starts on page 1 (the second sheet). The code provided above iterates zero times for that reason, which is not what we would expect to happen.

Something to note: after converting this file to XLSX using Microsoft Excel, Tabitha recognizes the page numbers correctly, so perhaps this is an XLS only issue.

The text was updated successfully, but these errors were encountered:

sagebind · 2018-10-05T16:43:57Z

This XLS file given is laid out very strangely in such a way as to confuse the parser. I'd almost call this file corrupt, even.

XLS is laid out as an ordered sequence of binary "records" that indicate various changes and hold data. This particular file has the "begin sheet template" and "begin sheet Values" records before any cell records, which I believe to be incorrect. (That said, XLS has no formal spec, and Excel opens it fine...)

Edit: This file is indeed valid, but laid out in a complex manner that will require Tabitha to do a lot more bookkeeping to parse it correctly. Files formatted in this way could cause rows to be emitted out of order or on a sheet different than the one they are actually located in. This is a bug we want to fix.

twilco changed the title ~~Tabitha incorrectly reporting page number for XLS file~~ Tabitha reporting incorrect page number for rows in XLS file Oct 3, 2018

sagebind added the a:excel label Oct 3, 2018

sagebind added this to the 0.5.0 milestone Oct 5, 2018

sagebind self-assigned this Oct 5, 2018

sagebind added e:hard Effort: Hard bug labels Oct 5, 2018

sagebind removed their assignment Oct 5, 2018

sagebind modified the milestones: 0.5.0, 0.6.0 Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tabitha reporting incorrect page number for rows in XLS file #28

Tabitha reporting incorrect page number for rows in XLS file #28

twilco commented Oct 3, 2018 •

edited

Loading

sagebind commented Oct 5, 2018 •

edited

Loading

Tabitha reporting incorrect page number for rows in XLS file #28

Tabitha reporting incorrect page number for rows in XLS file #28

Comments

twilco commented Oct 3, 2018 • edited Loading

sagebind commented Oct 5, 2018 • edited Loading

twilco commented Oct 3, 2018 •

edited

Loading

sagebind commented Oct 5, 2018 •

edited

Loading