Skip to content

Document Processing Pipelines

Adam Hooper edited this page Jul 17, 2017 · 5 revisions

Where to start? How about ImportJob: our progress-reporting mechanism. No matter how you upload files, Overview can always tell you:

  • The DocumentSet the files will go into. (Overview always creates a document set first and adds files to it second.)
  • Progress-reporting information: a way to set the user's expectations.

Beyond that, our import pipelines have a bit in common:

  • Every pipeline creates Document objects.
  • Documents are always generated in Overview's "worker" process (as opposed to its "web server" process).

What We're Generating

Each import pipeline creates Documents. Document data is stored in a few places:

  • Most document data is in the Postgres database, in the document table. In particular, document text and title (which Overview generates within these pipelines) and document notes and metadata (which the user provides) are stored here.
  • Tags are in the tag table, and document-tag lists are in the document_tag table.
  • Processed uploaded files are Files, with metadata in the file table and file contents in BlobStorage (Amazon S3 or the filesystem). Alongside each uploaded file is a generated PDF file Overview lets the user view.
  • When the user chooses to split by page, Overview generates a PDF per page for the user to view: that's in the page table and in BlobStorage.
  • Thumbnails are in BlobStorage.
  • Each document set also has a Lucene index containing document titles, text and metadata. The worker maintains those indexes on the filesystem.

Pipelines

File Upload Pipeline

  1. User uploads files into GroupedFileUploads (and Postgres Large Objects).

    1. On demand, the server creates a FileGroup to hold all the files the user will upload. (There is one FileGroup per User+DocumentSet, and DocumentSet may be null here.)
    2. The user streams each file into a GroupedFileUpload, assigning a client-generated GUID to handle resuming. See js-mass-upload for design details.
    3. The user clicks "Finish". Overview creates the DocumentSet if it's a new document set, then Overview sets FileGroup.addToDocumentSetId and kicks off the worker process.
  2. Worker creates documents. For each uploaded file:

    1. Worker generates a "view" PDF (for the user to see) out of the raw uploaded file:
      • For a non-PDF file (e.g., .doc), we convert to PDF by shelling to LibreOffice
      • For a PDF file, we use PdfOcr.makeSearchablePdf
    2. Worker copies the PDF raw uploaded file and "view" PDF file to BlobStorage.
    3. Worker reads the PDF text and creates and indexes a Document. If the user chose to split by page, worker generates a document per page and writes the page-view PDFs to BlobStorage in this step. Worker also generates thumbnails in this step.
    4. Worker deletes the GroupedFileUpload and its associated Postgres Large Object. (Now, if Overview restarts, it won't re-process this file.) Worker can have many actors, each processing a GroupedFileUpload. Worker aggregates progress reports from these simultaneous sources.
  3. Worker sorts the documents and writes the result to document_set.sorted_document_ids.

  4. Worker deletes the FileGroup.

When the user asks for a progress report, the web server builds an ImportJob from the file_group table.

DocumentCloud Pipeline

TODO

CSV-Import Pipeline

TODO

Clone this wiki locally