Optimising merged PDFs to reduce file size through duplicated resources

Hello!
Thank you for providing this library; it has been incredibly helpful!

## Problem
I need to be able to merge together many PDFs (potentially 100 or more).
Then, I need to reduce the file size by removing duplicate assets like fonts and images.
It also needs to be able to handle optimising content inside embedded forms.

 

Ideally I would be able to generate a single PDF containing all of the documents at the beginning, but this is not possible in my scenario.

 

## Solution 

I have written some code below which achieves this. 

It's not as efficient as generating a single PDF.
And I was wondering if you have any recommendations on how I could improve it.

I have found many historical forum posts and stack overflow questions asking how to do the same, but I was unable to find any solutions or example code.
So this seems like something which would be useful for others.

 

Summary of how it works:
* Check each image/font on each page & embedded form
* Create a hash of the image/font data stream
* If we've already seen the hash before, re-use the existing PdfReference object for that image/font

 

```cs
public class PDFSharpOptimiseMergedPDFsTest
{
 public byte[] CombinePdfs(List<byte[]> pdfs)
 {
 // https://www.pdfsharp.net/wiki-1.5/ConcatenateDocuments-sample.ashx#Source_Code_2
 using var resultPDF = MergePDFsSimple(pdfs);

 // Remove duplicated images and fonts
 var pageNumber = 0;
 foreach (PdfPage page in resultPDF.Pages)
 {
 pageNumber++;
 Console.WriteLine($"Scanning page {pageNumber}");
 OptimiseResources(page);
 Console.WriteLine();
 }

 using var ms = new MemoryStream();
 resultPDF.Save(ms);
 return ms.ToArray();
 }

 private PdfDocument MergePDFsSimple(List<byte[]> pdfs)
 {
 var resultPDF = new PdfDocument();
 foreach (var pdf in pdfs)
 {
 using var src = new MemoryStream(pdf);
 using var srcPDF = PdfReader.Open(src, PdfDocumentOpenMode.Import);
 for (var i = 0; i < srcPDF.PageCount; i++)
 {
 resultPDF.AddPage(srcPDF.Pages[i]);
 }
 }
 return resultPDF;
 }

 private readonly Dictionary<string, PdfReference> ImagesCache = [];
 private readonly Dictionary<string, PdfReference> FontsCache = [];

 private void OptimiseResources(PdfDictionary pdfItem)
 {
 var resources = pdfItem.Elements.GetDictionary("/Resources");
 if (resources == null) return;

 var fonts = resources.Elements.GetDictionary("/Font") ?? new PdfDictionary();
 var xobjects = resources.Elements.GetDictionary("/XObject") ?? new PdfDictionary();

 // Fonts
 foreach (var itemKey in fonts.Elements.Keys)
 {
 if (fonts.Elements[itemKey] is not PdfReference reference) continue;
 if (reference.Value is not PdfDictionary dictionary) continue;

 var fontName = dictionary.Elements.GetName("/BaseFont");
 if (fontName == null) continue;

 var fontFile2 = dictionary.Elements
 .GetArray("/DescendantFonts")?.Elements
 .GetDictionary(0)?.Elements
 .GetDictionary("/FontDescriptor")?.Elements
 .GetDictionary("/FontFile2");

 // TODO: There is also a FontFile3 and other types
 // https://github.com/empira/PDFsharp/blob/master/src/foundation/src/PDFsharp/src/PdfSharp/Pdf.Advanced/PdfFontDescriptor.cs#L328
 if (fontFile2?.Stream == null) continue;

 // Use the stream of the font data, to ensure the font is actually the same
 // For some reason, fonts like ArialUnicodeMS are different on every page
 // Maybe they are font subsets?
 var hash = ByteArrayToString(SHA512.HashData(fontFile2.Stream.Value));

 var useCache = FontsCache.TryGetValue(hash, out var cachedValue);

 Console.WriteLine($"Font {hash[..8]} - {fontName} {(useCache ? "(cached)" : "")}");
 if (!useCache)
 {
 // Add to cache
 FontsCache[hash] = reference;
 continue;
 }

 // Use cached value
 fonts.Elements[itemKey] = cachedValue;
 }

 Console.WriteLine();

 foreach (var itemKey in xobjects.Elements.Keys)
 {
 if (xobjects.Elements[itemKey] is not PdfReference reference) continue;
 if (reference.Value is not PdfDictionary dictionary) continue;

 // Images
 if (dictionary.Elements.GetString("/Subtype") == "/Image")
 {
 var imageLength = dictionary.Elements.GetInteger("/Length");
 if (imageLength == 0) continue;

 var hash = ByteArrayToString(SHA512.HashData(dictionary.Stream.Value));
 var useCache = ImagesCache.TryGetValue(hash, out var cachedValue);

 Console.WriteLine($"Image {hash[..8]} {(useCache ? "(cached)" : "")}");
 if (!useCache)
 {
 // Add to cache
 ImagesCache[hash] = reference;
 continue;
 }

 // Use cached value
 xobjects.Elements[itemKey] = cachedValue;
 }
 // Embedded forms/labels
 else if (dictionary.Elements.GetString("/Subtype") == "/Form")
 {
 Console.WriteLine($"Scanning embedded Form");
 OptimiseResources(dictionary);
 }
 }
 }

 // Consider using a custom IEqualityComparer<byte[]> instead of hashing every resource
 // Equals must be 100% accurate, but GetHashCode may sometimes generate collisions for different items 
 private static string ByteArrayToString(byte[] array)
 {
 var output = "";
 for (int i = 0; i < array.Length; i++)
 {
 output += $"{array[i]:X2}";
 if ((i % 4) == 3) output += " ";
 }
 return output;
 }
}
```


 


# Tests

In my test scenario, I have 3 PDFs & some images and fonts can be found in each PDF.
* File size if all 3 documents are generated as a single PDF
 * 81,711 bytes
* Total size of individual files
 * 221,695 bytes (2.71x)
* Size when merged using PDFSharp
 * 221,225 bytes (2.71x)
* Size when merged using PDFSharp & using the optimisation above
 * 123,345 bytes (1.51x)

 

# Further problems
* Some fonts (e.g. ArialUnicodeMS) seem to be different on each PDF
 * This might be because of subsets?
 * Is there a way to combine these?
* Is there a simpler way to compare embedded fonts?
 * I can see there are multiple properties that contain the font stream- FontFile, FontFile2, FontFile3
 * Should I just check all 3?
* Are there other resources I should optimise?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimising merged PDFs to reduce file size through duplicated resources #275

Problem

Solution

Tests

Further problems

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimising merged PDFs to reduce file size through duplicated resources #275

Description

Problem

Solution

Tests

Further problems

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions