Parse embedded metadata in PDF files #3108

microtherion · 2024-08-12T19:30:57Z

Added

Added: Added the ability to read embedded metadata from PDFs that conform to Calibre's embedding.

Implements PDF metadata processing as discussed in #3103. Some of this metadata is generic, and available in many PDFs. Other fields (series and index) are not as standardized, and for these we're using the calibre version of the fields.

This is not only my first Kavita contribution, but also my first time programming C#, so I hope I managed to write reasonable code, and would appreciate feedback if I did not.

majora2007

Preliminary look. I'm mainly concerned with the implementation to fetch the data from the pdf. I would like to investigate grabbing the string via docnet, which we already use for other things in Kavita.

API/Services/BookService.cs

microtherion · 2024-08-12T23:23:30Z

Implemented review feedback. Was not sure about the placement of PdfMetadataExtractor. Is API.Helpers OK?

majora2007

Overall code looks good, minus a few style points. Once the code is in shape, I can spare some time to pull it down and ensure it doesn't break anything.

API.Tests/Services/BookServiceTests.cs

API/Helpers/PdfMetadataExtractor.cs

majora2007 · 2024-08-13T12:07:51Z

Another thing that would be stellar is if you could update the wiki along with this to write the mapping of fields like we do with the epub reader and perhaps any settings needed on our calibre guide for ensuring metadata is written in the pdf.

microtherion · 2024-08-14T03:37:08Z

Addressed second round of comments, and added documentation in Kareadita/Wiki-Nextra#13

DieselTech · 2024-08-15T17:57:59Z

Did some initial testing of the PR and came across 2 aspects that need to be looked into:

1 - File locking - Once the scanner hits the files it doesn't seem to let go of the file lock. I couldn't move files around or modify them on the host file system after kavita scanned them.

https://cdn.discordapp.com/attachments/1273694551739072532/1273694566226333736/image.png?ex=66bf8c00&is=66be3a80&hm=3ef14bddff709d1a7e9f748db3066d070e509ee8419c48240b2ffe25cc9e9a4a&

2 - Long scan times per file. Even on my fairly powerful desktop, scan times were taking 5-6 seconds per file scanned in. It took almost an hour to bring in the 749 files I have in a test library. This is only going to get worse on lower powered hardware. Especially if it has to load the entire file to search for metadata. People TTRPG collections are going to have 300MB+ files and hundreds of them, if not thousands. That can easily start to add up to weeks of scanning time.

microtherion · 2024-08-15T19:24:21Z

@DieselTech do you see the same behavior re: (1) for epub and pdf?

For (2) were those 749 files pdf? If so, do you have comparison numbers for epub, so I know what I ought to aim for? It might inherently be a difficult problem, as there may not be a way around reading the whole file, but there may be an exploitable file structure in PDF that could be used.

DieselTech · 2024-08-15T19:27:59Z

2 was all PDF files. I didn't test any epub with this branch but I can just to get some comparison

microtherion · 2024-09-18T10:16:58Z

It turns out that PDF files indeed have an index that allows much faster access to the metadata than the implementation here, though accessing it is a bit trickier.

I'm still working on the new approach.

microtherion added 2 commits August 12, 2024 06:10

Parse embedded metadata in PDF files

55e7c52

Add test case (using existing test data), fix one discovered bug

c641b63

majora2007 reviewed Aug 12, 2024

View reviewed changes

microtherion added 2 commits August 13, 2024 00:23

Address smaller PR comments

2fe65c4

Refactor PDF metadata extraction

c92a39b

microtherion requested a review from majora2007 August 12, 2024 23:23

majora2007 requested changes Aug 13, 2024

View reviewed changes

Address second round of PR review comments

bebda41

microtherion mentioned this pull request Aug 14, 2024

Document metadata embedded in PDF files Kareadita/Wiki-Nextra#13

Open

microtherion requested a review from majora2007 August 14, 2024 03:37

majora2007 added the enhancement New feature or request label Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse embedded metadata in PDF files #3108

Parse embedded metadata in PDF files #3108

microtherion commented Aug 12, 2024 •

edited by majora2007

Loading

majora2007 left a comment

microtherion commented Aug 12, 2024

majora2007 left a comment

majora2007 commented Aug 13, 2024

microtherion commented Aug 14, 2024

DieselTech commented Aug 15, 2024

microtherion commented Aug 15, 2024

DieselTech commented Aug 15, 2024

microtherion commented Sep 18, 2024

Parse embedded metadata in PDF files #3108

Are you sure you want to change the base?

Parse embedded metadata in PDF files #3108

Conversation

microtherion commented Aug 12, 2024 • edited by majora2007 Loading

Added

majora2007 left a comment

Choose a reason for hiding this comment

microtherion commented Aug 12, 2024

majora2007 left a comment

Choose a reason for hiding this comment

majora2007 commented Aug 13, 2024

microtherion commented Aug 14, 2024

DieselTech commented Aug 15, 2024

microtherion commented Aug 15, 2024

DieselTech commented Aug 15, 2024

microtherion commented Sep 18, 2024

microtherion commented Aug 12, 2024 •

edited by majora2007

Loading