Skip to content

overview/overview-convert-framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Base image for Overview converters.

How a Converter Works

A converter's job is to turn files of one type into files of another type. It does this in a loop. It receives jobs from an internal Overview HTTP server.

This base image provides portable executables that communicate with Overview. They make up a framework: they'll call your converter program, which you can write in any language.

Your converter will have a Dockerfile that looks like this:

FROM overview/overview-converter-framework AS framework
# multi-stage build

FROM alpine:3.7 AS build
... (build your executables, including `do-convert-single-file`)

FROM alpine:3.7 AS production
# Add ca-certificates to let container download from S3 https:// URLs
RUN apk add --update --no-cache ca-certificates
WORKDIR /app
# The framework provides the main executable
COPY --from=framework /app/run /app/run
# Your `do-convert` code can choose from a few different input and output
# formats. The framework provides many `/app/convert` implementations: pick
# the one that matches your `do-convert`.
COPY --from=framework /app/convert-single-file /app/convert
COPY --from=build /app/do-convert-single-file /app/do-convert-single-file

/app/run

This framework runs on a loop:

  1. Download a task from Overview as JSON.
  2. Open a stream to download the body of the input file.
  3. Stream the body to /app/convert MIME-BOUNDARY JSON and pipe the results to Overview.

/app/run handles all communication with Overview. In particular:

  • /app/run polls for tasks at POLL_URL. Overview's administrator must set POLL_URL for your container.
  • /app/run will retry if there is a connection error.
  • /app/run will never crash.
  • TODO /app/run will poll Overview to check if the task is canceled. It will notify /app/convert with SIGINT if the task is canceled.

/app/convert -- a.k.a., /app/convert-*

/app/convert is a program we provide, under a few different names. That is, when you create your program you'll choose one of the following implementations to copy into /app/convert in your image.

From /app/run's point of view, /app/convert will read the input stream and JSON command-line argument and produce a multipart/form-data output stream with MIME boundary MIME-BOUNDARY (in C lingo, argv[1]). /app/convert will never crash, and it will always output a data stream that Overview can handle.

Your code is invoked by /app/convert, following one of these strategies:

/app/convert-single-file

This version of /app/convert will:

  1. Write standard input to input.blob in a temporary directory and verify it's the correct size
  2. Run /app/do-convert-single-file JSON (your code) in the temporary directory
  3. Translate the stdout from your code into progress events or an error event
  4. When your code exits with status 0 and no error message, pipe output.json, output.blob -- and if they exist, output-thumbnail.jpg, output-thumbnail.png and output.txt -- and a done event

Special cases:

  • Cancelation: if /app/run sends a SIGINT signal, sends your program SIGINT. Your program should kill and wait for any child processes, then exit. Its standard output and standard error will be ignored.
  • Error: if /app/do-convert-single-file exits with non-zero return value, pipes an error event.

You must provide /app/do-convert-single-file. The framework will invoke /app/do-convert JSON. Your program can read input.blob in the current working directory. Your program must:

  1. Write progress messages to stdout, newline-delimited, that look like:
    • p1/2 -- "finished processing page 1 of 2"
    • b102/412 -- "finished processing byte 102 of 412"
    • 0.324 -- "finished processing 32.4% of input"
    • anything else at all -- "ERROR: [the line of text]"
  2. Write output.json, output.blob, and optionally output-thumbnail.jpg, output-thumbnail.png and/or output.txt.
  3. Exit with status code 0. Any other exit code is an error in your code.

Testing: /app/test-convert-single-file

You can test /app/do-convert-single-file by creating a Docker image with the special framework program, /app/test-convert-single-file. This is designed to integrate with automated build enviroments like Docker Hub.

Your Docker build stage doesn't need a CMD. It should include:

  • /app/test-convert-single-file -- and you should RUN [ "/app/test-convert-single-file" ]
  • /app/do-convert-single-file and everything it depends on -- /app/test-convert-single-file will invoke it once per test
  • /app/test/test-*: one directory per test, e.g. /app/test/test-with-ocr. Each test directory should contain:
    • input.blob
    • input.json -- the JSON passed to do-convert-single-file
    • stdout -- expected standard output from do-convert-single-file
    • 0.blob -- expected 0.blob output
    • 0.json -- expected 0.json output
    • 0.txt (optional) -- expected 0.txt output
    • 0-thumbnail.{png,jpg} (optional) -- expected output

test-convert-single-file will run do-convert-single-file in a separate directory per test. It will output in TAP format and exit with status code 1 if any test fails.

Copying failed-test files from the test suite

The test output is designed to help you correct your tests. For instance, here is example output from a test that fails because you did not write 0-thumbnail.jpg

Step 12/13 : RUN [ "/app/test-convert-single-file" ]
 ---> Running in f65521f3a30c
1..3
Tesseract Open Source OCR Engine v3.04.01 with Leptonica
not ok 1 - test-jpg-ocr
    do-convert-single-file wrote /tmp/test-do-convert-single-file912093989/0-thumbnail.jpg, but we expected it not to exist
...

Upon seeing this error, you can docker cp f65521f3a30c:/tmp/test-do-convert-single-file912093989/0-thumbnail.jpg . to inspect the file in question (and perhaps make it the expected one).

Testing PDF conversion

PDF output is a common case. We use QPDF for file comparison, to ease debugging. Your Dockerfile must install QPDF -- e.g., apk --no-cache add qpdf -- before running RUN [ "/app/test-convert-single-file" ] if you are testing PDF output.

/app/convert-stream-to-mime-multipart

This version of /app/convert will:

  1. Create an empty temporary directory
  2. Run /app/do-convert-stream-to-mime-multipart MIME-BOUNDARY JSON (your code) within the temporary directory
  3. Stream the input file from Overview to your program's stdin and and pipe your program's stdout to Overview

Special cases:

  • Cancelation: if /app/run sends a SIGINT signal, sends your program SIGINT. Your program should kill and wait for any child processes, then exit. Its standard output and standard error will be ignored.
  • Error: if your program exits with non-zero return value, pipes an error event.
  • Buggy code: emits an error event if your program does not produce a error or done event or end with --MIME-BOUNDARY--.
  • Temporary files: if your program emits temporary files to its current working directory, they will be deleted.

You must provide /app/do-convert-stream-to-mime-multipart. The framework will invoke it with MIME-BOUNDARY and JSON as arguments. MIME-BOUNDARY will match the regex [a-fA-F0-9]{1,60}. Your program can read input.blob in the current directory.

Your program must write valid multipart/form-data output to stdout. For instance:

--MIME-BOUNDARY\r\n
Content-Disposition: form-data; name="0.json"\r\n
\r\n
{JSON for first output file}\r\n
--MIME-BOUNDARY\r\n
Content-Disposition: form-data; name="0.blob"\r\n
\r\n
Blob for first output file\r\n
--MIME-BOUNDARY\r\n
Content-Disposition: form-data; name="progress"\r\n
\r\n
{"pages":{"nProcessed":1,"nTotal":3}}\r\n
--MIME-BOUNDARY\r\n
Content-Disposition: form-data; name="done"\r\n
\r\n
--MIME-BOUNDARY--

Rules:

  • Your output must end with a done or error element. A done element should be empty; an error element must include an error message.
  • Your output must be in order: 0.json, 0.blob, (optionally 0.png, 0.jpg and/or 0.txt), 1.json, 1.blob, ..., done.
  • You should output an accurate progress report before each N.json to help Overview's progressbar behave well.

Roll your own

Even more lightweight than /app/convert-stream-to-mime-multipart is to roll your own version of /app/convert. Beware, though:

  • Your own version of /app/convert must always output messages to Overview: especially a done or error event. Without those events, Overview will never finish processing the file: it will retry indefinitely.
  • Your own version of /app/convert must always exit successfully. The trickiest case, in our experience, is handling "out of memory." If your /app/convert does not exit successfully, Overview will retry indefinitely and the file will never be processed.
  • Your own version of /app/convert should output helpful error messages, so you can debug it easily.
  • Your own version of /app/convert should end quickly after receiving SIGUSR, because Overview will ignore all further output.
  • Your own version of /app/convert must ensure temporary files invoked during one invocation aren't read by the next invocation: that would leak users' documents to other users.

/app/convert-stream-to-mime-multipart is small and fast, and it solves these problems for you. You probably want it.

To Maintain This Repository

Coding

./dev will start a development loop that runs tests. Restart it if you edit Dockerfile.

Testing

docker build . will run all tests.

Tests are in ./test/*/suite.bats. They're run in bats, an ideal framework for testing programs that pipe data around.

Releasing

./release MAJOR.MINOR.PATCH will push to GitHub. Docker Hub will build the images for mass consumption.

License

This software is Copyright 2011-2018 Jonathan Stray and Copyright 2019-2020 Overview Computing Inc., and distributed under the terms of the GNU Affero General Public License. See the LICENSE file for details.

About

Base Docker image for Overview converters

Resources

License

Stars

Watchers

Forks

Packages

No packages published