Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tabitha reporting incorrect page number for rows in XLS file #28

Open
twilco opened this issue Oct 3, 2018 · 1 comment
Open

Tabitha reporting incorrect page number for rows in XLS file #28

twilco opened this issue Oct 3, 2018 · 1 comment
Labels
Milestone

Comments

@twilco
Copy link
Contributor

twilco commented Oct 3, 2018

Tabitha appears to be assigning the wrong row number to rows in the XLS file found at this URL: https://s3.amazonaws.com/widen-ingester-dev/tabithaDoesn'tSeeFirstPageSanitized.xls

In our application, we only care about rows in the first sheet (page). Our code looks like this:

try (InputStream delimFileStream = amazonS3.getObject(s3BucketPath, s3Key).getObjectContent();
     RowReader reader = RowReaders.open(delimFileStream, fileName).blockingGet()) {

    if (reader == null) {
        ...stuff...
    }

    Iterable<Row> rows = reader.rows().takeWhile(row -> row.page() == 0).blockingIterable();
    for (Row row : rows) {
        ...logic to act upon rows in first sheet/page...
    }
}

In the attached spreadsheet, there is data in the first sheet, but Tabitha seems to think everything instead starts on page 1 (the second sheet). The code provided above iterates zero times for that reason, which is not what we would expect to happen.

Something to note: after converting this file to XLSX using Microsoft Excel, Tabitha recognizes the page numbers correctly, so perhaps this is an XLS only issue.

@twilco twilco changed the title Tabitha incorrectly reporting page number for XLS file Tabitha reporting incorrect page number for rows in XLS file Oct 3, 2018
@sagebind sagebind added this to the 0.5.0 milestone Oct 5, 2018
@sagebind sagebind self-assigned this Oct 5, 2018
@sagebind
Copy link
Member

sagebind commented Oct 5, 2018

This XLS file given is laid out very strangely in such a way as to confuse the parser. I'd almost call this file corrupt, even.

XLS is laid out as an ordered sequence of binary "records" that indicate various changes and hold data. This particular file has the "begin sheet template" and "begin sheet Values" records before any cell records, which I believe to be incorrect. (That said, XLS has no formal spec, and Excel opens it fine...)


Edit: This file is indeed valid, but laid out in a complex manner that will require Tabitha to do a lot more bookkeeping to parse it correctly. Files formatted in this way could cause rows to be emitted out of order or on a sheet different than the one they are actually located in. This is a bug we want to fix.

@sagebind sagebind added e:hard Effort: Hard bug labels Oct 5, 2018
@sagebind sagebind removed their assignment Oct 5, 2018
@sagebind sagebind modified the milestones: 0.5.0, 0.6.0 Oct 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants