Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outline (other metadata?) not preserved using cat #31

Open
jwhendy opened this issue May 1, 2017 · 6 comments
Open

outline (other metadata?) not preserved using cat #31

jwhendy opened this issue May 1, 2017 · 6 comments
Labels

Comments

@jwhendy
Copy link

jwhendy commented May 1, 2017

tl;dr this got long as I tried to investigate. The short summary is it appears the cat function strips off the handy outline/bookmark index of the document. PyPDF2 appears to support this, so this can serve as both 1) a notice this is happening if you weren't aware and 2) a feature request.

In my opinion, if running cat just to spit out the file, stapler should maintain whatever features already existed. I could also see it being a handy option in general, but merging the TOC/index locations of multiple files might get messy?


I just submitted #30 and during my testing I used stapler cat in full, and noticed that the file size was different. Opening them both up, I noticed that in evince, at least, there was no outline/table of contents in the stapler generated version, but there was in the original. Here's both open (evince, arch linux) with the original on the left (Outline view in side pane) and the stapler generated on the right, showing that this view is not available. Also of note is the "meta title" is removed from the stapler version.

2017-05-01_182957

I thought perhaps PyPDF doesn't provide this, but in looking around it appears it might, or at least some ability:

PdfFileReader

getOutlines(node=None, outlines=None)
Retrieves the document outline present in the document.

Returns: a nested list of Destinations.

PdfFileWriter

addBookmark(title, pagenum, parent=None, color=None, bold=False, italic=False, fit='/Fit', *args)
Add a bookmark to this PDF file.

This seemed straight foward: just translate getOutlines() to addBookmark? Not as easy... I couldn't seem to find a way to get the page number from the result (though I'm an absolute python novice, so no surprise there). After some fiddling, I was able to use some example code to manually add a bookmark, and found that at least two answers tried to tackle the issue of converting getOutlines() return location ID thingy into a page number.[1] [2]

Find attached:

  • test-original.pdf: file generated using Org-mode/LaTeX that I knew would feature a TOC/outline
  • test-stapler.pdf: file produced with stapler cat test.pdf test-cat.pdf
  • test-pypdf2.pdf: file produced from the following code
#!/usr/bin/env python2

from PyPDF2 import PdfFileWriter, PdfFileReader

# code for translating from bookmark to page number
# - http://stackoverflow.com/questions/1918420/split-a-pdf-based-on-outline
# - http://stackoverflow.com/questions/8329748/how-to-get-bookmarks-page-number
def _setup_page_id_to_num(pdf, pages=None, _result=None, _num_pages=None):
    if _result is None:
        _result = {}
    if pages is None:
        _num_pages = []
        pages = pdf.trailer["/Root"].getObject()["/Pages"].getObject()
    t = pages["/Type"]
    if t == "/Pages":
        for page in pages["/Kids"]:
            _result[page.idnum] = len(_num_pages)
            _setup_page_id_to_num(pdf, page.getObject(), _result, _num_pages)
    elif t == "/Page":
        _num_pages.append(1)
    return _result

orig = PdfFileReader(open("./test-original.pdf", "rb"))
outPdf = PdfFileWriter()
outStream = file("./test-pypdf2.pdf", "wb")

outPdf.addPage(orig.getPage(0))
outPdf.addPage(orig.getPage(1))
outPdf.addPage(orig.getPage(2))

id_to_nums = _setup_page_id_to_num(orig)
outline = orig.getOutlines()

for entry in outline :
  title = entry["/Title"]
  page = id_to_nums[entry.page.idnum] ## +1 in original code (physical page, not index)
  print title
  outPdf.addBookmark(title, page, parent = None)
  
outPdf.write(outStream)

attachments

footnotes

@hellerbarde
Copy link
Owner

Thank you for both of your very thorough issues! I wonder if we can somehow feed the outline back in to the FileWriter... The workaround you mentioned with the bookmarks sounds a little tedious and not very satisfying to use.

@jwhendy
Copy link
Author

jwhendy commented May 15, 2017

@hellerbarde Thanks for taking a look, and happy to submit for a piece of software I think is just so great to have around (even more so when I can show windows users at work what's possible :) ).

One other issue with trying to pull together the indices, is will you, the programmer, ever know what derivation of original indices I want if I'm catting multiple files? Or say there was a bookmark on pg 3 and someone cats 4-n; should you ditch the original since they extracted after, or include it since pg 4 is still part of that section?

No easy answers, and thanks for taking a look!

@hellerbarde
Copy link
Owner

I have looked into this now and it seems the underlying library doesn't offer much assistance here unfortunately. I'll have to see if I can do anything about it.

@jwhendy
Copy link
Author

jwhendy commented Aug 24, 2017

No worries at all, and you can close if you want. I'm guessing this is pretty fringe, as I would only have expected it to work on full docs. As mentioned in the above comment, if someone is merging a bunch of extractions from different files, I don't see a good way to guess what original bookmarks/sections should be included. Thanks for looking, and this is still an amazing tool :)

@hellerbarde
Copy link
Owner

OK. I'll leave this open as a reminder to look into it again for merging complete documents. But I don't think I'll dive into the nuts and bolts of figuring out how PDF does TOCs... :)

Thanks for the kind words and stuff.

PS: I'm working on a GUI for concatenating files. Psst, mum's the word!

@Frenzie
Copy link
Contributor

Frenzie commented Aug 29, 2017

Apologies for the off topic remark in advance but I figured this could be useful to some. For GUIs there's also a useful little program called pdfshuffler, but I'm not sure how it handles TOCs. I usually turn to pdftk cat for that kind of thing because it deals with it fairly well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants