Converts text files for Overview.
This program always outputs 0.json
, 0.txt
and 0.blob
and
0-thumbnail.jpg
.
The output JSON has "wantOcr": false
and "contentType": "application/pdf". Other input JSON (in particular,
"wantSplitByPage"`) is passed through.
We guess character set using chardet. If the user wants to guarantee a certain character set, the user must encode the text as UTF-8: it's the only character set we detect 100% accurately.
We generate the PDF using ReportLab. It's the fastest, despite its overpowered API.
We syntax-highlight using Pygments because ... well ... what a fun feature,
right? We trust the contentType
Overview sends us.
We generate a thumbnail using pdftocairo
. We output JPG: it saves ~0.03s.
Right now, we only embed Noto Sans Mono. The annoying reality is that TTF fonts support a maximum of 64k characters. TODO solve this problem, so we can display text in all supported languages.
Write to test/test-XYZ
. docker build .
will run the tests.
Each test has input.blob
(which means the same as in production) and
input.json
(whose contents are $1
in do-convert-single-file
). The files
stdout
, 0.json
, 0.blob
, 0.txt
, and 0-thumbnail.(png|jpg)
in the
test directory are expected values. If actual values differ from expected
values, the test fails.
PDF, PNG and JPEG are tricky formats to get exactly right. You may need to use
the Docker image itself to generate expected output files. For instance, this is
how we built test/test-1page/0-thumbnail.png
:
- Wrote
test/test-1page/{input.json,input.blob,0.txt,0.blob,stdout}
- Ran
docker build .
. The end of the output looked like this:Step 12/13 : RUN [ "/app/test-convert-single-file" ] ---> Running in f65521f3a30c 1..3 not ok 1 - test-1page do-convert-single-file wrote /tmp/test-do-convert-single-file912093989/0-thumbnail.jpg, but we expected it not to exist ...
docker cp f65521f3a30c:/tmp/test-do-convert-single-file912093989/0-thumbnail.jpg test/test-1page/
docker rm -f f65521f3a30c