Skip to content

Optimising merged PDFs to reduce file size through duplicated resources #275

@BadgerCode

Description

@BadgerCode

Hello!
Thank you for providing this library; it has been incredibly helpful!

Problem

I need to be able to merge together many PDFs (potentially 100 or more).
Then, I need to reduce the file size by removing duplicate assets like fonts and images.
It also needs to be able to handle optimising content inside embedded forms.


Ideally I would be able to generate a single PDF containing all of the documents at the beginning, but this is not possible in my scenario.


Solution

I have written some code below which achieves this.

It's not as efficient as generating a single PDF.
And I was wondering if you have any recommendations on how I could improve it.

I have found many historical forum posts and stack overflow questions asking how to do the same, but I was unable to find any solutions or example code.
So this seems like something which would be useful for others.


Summary of how it works:

  • Check each image/font on each page & embedded form
  • Create a hash of the image/font data stream
  • If we've already seen the hash before, re-use the existing PdfReference object for that image/font

public class PDFSharpOptimiseMergedPDFsTest
{
    public byte[] CombinePdfs(List<byte[]> pdfs)
    {
        // https://www.pdfsharp.net/wiki-1.5/ConcatenateDocuments-sample.ashx#Source_Code_2
        using var resultPDF = MergePDFsSimple(pdfs);

        // Remove duplicated images and fonts
        var pageNumber = 0;
        foreach (PdfPage page in resultPDF.Pages)
        {
            pageNumber++;
            Console.WriteLine($"Scanning page {pageNumber}");
            OptimiseResources(page);
            Console.WriteLine();
        }

        using var ms = new MemoryStream();
        resultPDF.Save(ms);
        return ms.ToArray();
    }

    private PdfDocument MergePDFsSimple(List<byte[]> pdfs)
    {
        var resultPDF = new PdfDocument();
        foreach (var pdf in pdfs)
        {
            using var src = new MemoryStream(pdf);
            using var srcPDF = PdfReader.Open(src, PdfDocumentOpenMode.Import);
            for (var i = 0; i < srcPDF.PageCount; i++)
            {
                resultPDF.AddPage(srcPDF.Pages[i]);
            }
        }
        return resultPDF;
    }

    private readonly Dictionary<string, PdfReference> ImagesCache = [];
    private readonly Dictionary<string, PdfReference> FontsCache = [];

    private void OptimiseResources(PdfDictionary pdfItem)
    {
        var resources = pdfItem.Elements.GetDictionary("/Resources");
        if (resources == null) return;

        var fonts = resources.Elements.GetDictionary("/Font") ?? new PdfDictionary();
        var xobjects = resources.Elements.GetDictionary("/XObject") ?? new PdfDictionary();

        // Fonts
        foreach (var itemKey in fonts.Elements.Keys)
        {
            if (fonts.Elements[itemKey] is not PdfReference reference) continue;
            if (reference.Value is not PdfDictionary dictionary) continue;

            var fontName = dictionary.Elements.GetName("/BaseFont");
            if (fontName == null) continue;

            var fontFile2 = dictionary.Elements
                .GetArray("/DescendantFonts")?.Elements
                .GetDictionary(0)?.Elements
                .GetDictionary("/FontDescriptor")?.Elements
                .GetDictionary("/FontFile2");

            // TODO: There is also a FontFile3 and other types
            // https://github.com/empira/PDFsharp/blob/master/src/foundation/src/PDFsharp/src/PdfSharp/Pdf.Advanced/PdfFontDescriptor.cs#L328
            if (fontFile2?.Stream == null) continue;

            // Use the stream of the font data, to ensure the font is actually the same
            // For some reason, fonts like ArialUnicodeMS are different on every page
            // Maybe they are font subsets?
            var hash = ByteArrayToString(SHA512.HashData(fontFile2.Stream.Value));

            var useCache = FontsCache.TryGetValue(hash, out var cachedValue);

            Console.WriteLine($"Font {hash[..8]} - {fontName} {(useCache ? "(cached)" : "")}");
            if (!useCache)
            {
                // Add to cache
                FontsCache[hash] = reference;
                continue;
            }

            // Use cached value
            fonts.Elements[itemKey] = cachedValue;
        }

        Console.WriteLine();

        foreach (var itemKey in xobjects.Elements.Keys)
        {
            if (xobjects.Elements[itemKey] is not PdfReference reference) continue;
            if (reference.Value is not PdfDictionary dictionary) continue;

            // Images
            if (dictionary.Elements.GetString("/Subtype") == "/Image")
            {
                var imageLength = dictionary.Elements.GetInteger("/Length");
                if (imageLength == 0) continue;

                var hash = ByteArrayToString(SHA512.HashData(dictionary.Stream.Value));
                var useCache = ImagesCache.TryGetValue(hash, out var cachedValue);

                Console.WriteLine($"Image {hash[..8]} {(useCache ? "(cached)" : "")}");
                if (!useCache)
                {
                    // Add to cache
                    ImagesCache[hash] = reference;
                    continue;
                }

                // Use cached value
                xobjects.Elements[itemKey] = cachedValue;
            }
            // Embedded forms/labels
            else if (dictionary.Elements.GetString("/Subtype") == "/Form")
            {
                Console.WriteLine($"Scanning embedded Form");
                OptimiseResources(dictionary);
            }
        }
    }

    // Consider using a custom IEqualityComparer<byte[]> instead of hashing every resource
    // Equals must be 100% accurate, but GetHashCode may sometimes generate collisions for different items 
    private static string ByteArrayToString(byte[] array)
    {
        var output = "";
        for (int i = 0; i < array.Length; i++)
        {
            output += $"{array[i]:X2}";
            if ((i % 4) == 3) output += " ";
        }
        return output;
    }
}

Tests

In my test scenario, I have 3 PDFs & some images and fonts can be found in each PDF.

  • File size if all 3 documents are generated as a single PDF
    • 81,711 bytes
  • Total size of individual files
    • 221,695 bytes (2.71x)
  • Size when merged using PDFSharp
    • 221,225 bytes (2.71x)
  • Size when merged using PDFSharp & using the optimisation above
    • 123,345 bytes (1.51x)

Further problems

  • Some fonts (e.g. ArialUnicodeMS) seem to be different on each PDF
    • This might be because of subsets?
    • Is there a way to combine these?
  • Is there a simpler way to compare embedded fonts?
    • I can see there are multiple properties that contain the font stream- FontFile, FontFile2, FontFile3
    • Should I just check all 3?
  • Are there other resources I should optimise?

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions