-
Notifications
You must be signed in to change notification settings - Fork 15
/
ten-simple-rules-dockerfiles.tex
2144 lines (1832 loc) · 103 KB
/
ten-simple-rules-dockerfiles.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Template for PLoS
% Version 3.5 March 2018
%
% % % % % % % % % % % % % % % % % % % % % %
%
% -- IMPORTANT NOTE
%
% This template contains comments intended
% to minimize problems and delays during our production
% process. Please follow the template instructions
% whenever possible.
%
% % % % % % % % % % % % % % % % % % % % % % %
%
% Once your paper is accepted for publication,
% PLEASE REMOVE ALL TRACKED CHANGES in this file
% and leave only the final text of your manuscript.
% PLOS recommends the use of latexdiff to track changes during review, as this will help to maintain a clean tex file.
% Visit https://www.ctan.org/pkg/latexdiff?lang=en for info or contact us at [email protected].
%
%
% There are no restrictions on package use within the LaTeX files except that
% no packages listed in the template may be deleted.
%
% Please do not include colors or graphics in the text.
%
% The manuscript LaTeX source should be contained within a single file (do not use \input, \externaldocument, or similar commands).
%
% % % % % % % % % % % % % % % % % % % % % % %
%
% -- FIGURES AND TABLES
%
% Please include tables/figure captions directly after the paragraph where they are first cited in the text.
%
% DO NOT INCLUDE GRAPHICS IN YOUR MANUSCRIPT
% - Figures should be uploaded separately from your manuscript file.
% - Figures generated using LaTeX should be extracted and removed from the PDF before submission.
% - Figures containing multiple panels/subfigures must be combined into one image file before submission.
% For figure citations, please use "Fig" instead of "Figure".
% See http://journals.plos.org/plosone/s/figures for PLOS figure guidelines.
%
% Tables should be cell-based and may not contain:
% - spacing/line breaks within cells to alter layout or alignment
% - do not nest tabular environments (no tabular environments within tabular environments)
% - no graphics or colored text (cell background color/shading OK)
% See http://journals.plos.org/plosone/s/tables for table guidelines.
%
% For tables that exceed the width of the text column, use the adjustwidth environment as illustrated in the example table in text below.
%
% % % % % % % % % % % % % % % % % % % % % % % %
%
% -- EQUATIONS, MATH SYMBOLS, SUBSCRIPTS, AND SUPERSCRIPTS
%
% IMPORTANT
% Below are a few tips to help format your equations and other special characters according to our specifications. For more tips to help reduce the possibility of formatting errors during conversion, please see our LaTeX guidelines at http://journals.plos.org/plosone/s/latex
%
% For inline equations, please be sure to include all portions of an equation in the math environment.
%
% Do not include text that is not math in the math environment.
%
% Please add line breaks to long display equations when possible in order to fit size of the column.
%
% For inline equations, please do not include punctuation (commas, etc) within the math environment unless this is part of the equation.
%
% When adding superscript or subscripts outside of brackets/braces, please group using {}.
%
% Do not use \cal for caligraphic font. Instead, use \mathcal{}
%
% % % % % % % % % % % % % % % % % % % % % % % %
%
% Please contact [email protected] with any questions.
%
% % % % % % % % % % % % % % % % % % % % % % % %
\documentclass[10pt,letterpaper]{article}
\usepackage[top=0.85in,left=2.75in,footskip=0.75in]{geometry}
% amsmath and amssymb packages, useful for mathematical formulas and symbols
\usepackage{amsmath,amssymb}
% Use adjustwidth environment to exceed column width (see example table in text)
\usepackage{changepage}
% Use Unicode characters when possible
\usepackage[utf8x]{inputenc}
% textcomp package and marvosym package for additional characters
\usepackage{textcomp,marvosym}
% cite package, to clean up citations in the main text. Do not remove.
% \usepackage{cite}
% Use nameref to cite supporting information files (see Supporting Information section for more info)
\usepackage{nameref,hyperref}
% line numbers
\usepackage[right]{lineno}
% ligatures disabled
\usepackage{microtype}
\DisableLigatures[f]{encoding = *, family = * }
% color can be used to apply background shading to table cells only
\usepackage[table]{xcolor}
% array package and thick rules for tables
\usepackage{array}
% create "+" rule type for thick vertical lines
\newcolumntype{+}{!{\vrule width 2pt}}
% create \thickcline for thick horizontal lines of variable length
\newlength\savedwidth
\newcommand\thickcline[1]{%
\noalign{\global\savedwidth\arrayrulewidth\global\arrayrulewidth 2pt}%
\cline{#1}%
\noalign{\vskip\arrayrulewidth}%
\noalign{\global\arrayrulewidth\savedwidth}%
}
% \thickhline command for thick horizontal lines that span the table
\newcommand\thickhline{\noalign{\global\savedwidth\arrayrulewidth\global\arrayrulewidth 2pt}%
\hline
\noalign{\global\arrayrulewidth\savedwidth}}
% Remove comment for double spacing
%\usepackage{setspace}
%\doublespacing
% Text layout
\raggedright
\setlength{\parindent}{0.5cm}
\textwidth 5.25in
\textheight 8.75in
% Bold the 'Figure #' in the caption and separate it from the title/caption with a period
% Captions will be left justified
\usepackage[aboveskip=1pt,labelfont=bf,labelsep=period,justification=raggedright,singlelinecheck=off]{caption}
\renewcommand{\figurename}{Fig}
% Use the PLoS provided BiBTeX style
% \bibliographystyle{plos2015}
% Remove brackets from numbering in List of References
\makeatletter
\renewcommand{\@biblabel}[1]{\quad#1.}
\makeatother
% Header and Footer with logo
\usepackage{lastpage,fancyhdr,graphicx}
\usepackage{epstopdf}
%\pagestyle{myheadings}
\pagestyle{fancy}
\fancyhf{}
%\setlength{\headheight}{27.023pt}
%\lhead{\includegraphics[width=2.0in]{PLOS-submission.eps}}
\rfoot{\thepage/\pageref{LastPage}}
\renewcommand{\headrulewidth}{0pt}
\renewcommand{\footrule}{\hrule height 2pt \vspace{2mm}}
\fancyheadoffset[L]{2.25in}
\fancyfootoffset[L]{2.25in}
\lfoot{\today}
%% Include all macros below
\newcommand{\lorem}{{\bf LOREM}}
\newcommand{\ipsum}{{\bf IPSUM}}
\usepackage{color}
\usepackage{fancyvrb}
\newcommand{\VerbBar}{|}
\newcommand{\VERB}{\Verb[commandchars=\\\{\}]}
\DefineVerbatimEnvironment{Highlighting}{Verbatim}{commandchars=\\\{\}}
% Add ',fontsize=\small' for more characters per line
\usepackage{framed}
\definecolor{shadecolor}{RGB}{248,248,248}
\newenvironment{Shaded}{\begin{snugshade}}{\end{snugshade}}
\newcommand{\AlertTok}[1]{\textcolor[rgb]{0.94,0.16,0.16}{#1}}
\newcommand{\AnnotationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\AttributeTok}[1]{\textcolor[rgb]{0.77,0.63,0.00}{#1}}
\newcommand{\BaseNTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\BuiltInTok}[1]{#1}
\newcommand{\CharTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\CommentTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textit{#1}}}
\newcommand{\CommentVarTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\ConstantTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\ControlFlowTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{#1}}}
\newcommand{\DataTypeTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{#1}}
\newcommand{\DecValTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\DocumentationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\ErrorTok}[1]{\textcolor[rgb]{0.64,0.00,0.00}{\textbf{#1}}}
\newcommand{\ExtensionTok}[1]{#1}
\newcommand{\FloatTok}[1]{\textcolor[rgb]{0.00,0.00,0.81}{#1}}
\newcommand{\FunctionTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\ImportTok}[1]{#1}
\newcommand{\InformationTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\newcommand{\KeywordTok}[1]{\textcolor[rgb]{0.13,0.29,0.53}{\textbf{#1}}}
\newcommand{\NormalTok}[1]{#1}
\newcommand{\OperatorTok}[1]{\textcolor[rgb]{0.81,0.36,0.00}{\textbf{#1}}}
\newcommand{\OtherTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{#1}}
\newcommand{\PreprocessorTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textit{#1}}}
\newcommand{\RegionMarkerTok}[1]{#1}
\newcommand{\SpecialCharTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\SpecialStringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\StringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\VariableTok}[1]{\textcolor[rgb]{0.00,0.00,0.00}{#1}}
\newcommand{\VerbatimStringTok}[1]{\textcolor[rgb]{0.31,0.60,0.02}{#1}}
\newcommand{\WarningTok}[1]{\textcolor[rgb]{0.56,0.35,0.01}{\textbf{\textit{#1}}}}
\usepackage[T1]{fontenc}
\usepackage{lmodern}
\usepackage[user,titleref]{zref}
\usepackage{nameref}
\newcommand*{\rulelabel}[2]{\ztitlerefsetup{title=#1} \zlabel{#2} \label{#2} \zrefused{#2}}
\newcommand*{\ruleref}[1]{\hyperref[{#1}]{Rule~\ztitleref{#1}}}
\usepackage{listings}
\lstdefinelanguage{docker}{keywords={FROM, RUN, COPY, ADD, ENTRYPOINT, CMD, ENV, ARG, WORKDIR, EXPOSE, LABEL, USER, VOLUME, STOPSIGNAL, ONBUILD, MAINTAINER}, keywordstyle=\color{blue}\ttfamily, identifierstyle=\color{black}\ttfamily, sensitive=false, comment=[l]{\#}, commentstyle=\color{darkgray}\ttfamily, stringstyle=\color{red}\ttfamily, morestring=[b]', morestring=[b]"}
\lstset{language=docker, literate={ü}{{\"u}}1 {-}{-}1, showstringspaces=false}
\usepackage{forarray}
\usepackage{xstring}
\newcommand{\getIndex}[2]{
\ForEach{,}{\IfEq{#1}{\thislevelitem}{\number\thislevelcount\ExitForEach}{}}{#2}
}
\setcounter{secnumdepth}{0}
\newcommand{\getAff}[1]{
\getIndex{#1}{ifgi,stancomp,uowanthro,mathcam,wtt,tou,uobpsych}
}
\providecommand{\tightlist}{%
\setlength{\itemsep}{0pt}\setlength{\parskip}{0pt}}
\begin{document}
\vspace*{0.2in}
% Title must be 250 characters or less.
\begin{flushleft}
{\Large
\textbf\newline{Ten Simple Rules for Writing Dockerfiles for Reproducible Data Science} % Please use "sentence case" for title and headings (capitalize only the first word in a title (or heading), the first word in a subtitle (or subheading), and any proper nouns).
}
\newline
% Insert author names, affiliations and corresponding author email (do not include titles, positions, or degrees).
\\
Daniel Nüst\textsuperscript{\getAff{ifgi}}\textsuperscript{*},
Vanessa Sochat\textsuperscript{\getAff{stancomp}},
Ben Marwick\textsuperscript{\getAff{uowanthro}},
Stephen J. Eglen\textsuperscript{\getAff{mathcam}},
Tim Head\textsuperscript{\getAff{wtt}},
Tony Hirst\textsuperscript{\getAff{tou}},
Benjamin D. Evans\textsuperscript{\getAff{uobpsych}}\\
\bigskip
\textbf{\getAff{ifgi}}Institute for Geoinformatics, University of Münster, Münster, Germany\\
\textbf{\getAff{stancomp}}Stanford Research Computing Center, Stanford University, Stanford,
California, USA\\
\textbf{\getAff{uowanthro}}Department of Anthropology, University of Washington, Seattle,
Washington, USA\\
\textbf{\getAff{mathcam}}Department of Applied Mathematics and Theoretical Physics, University of
Cambridge, Cambridge, Cambridgeshire, Great Britain\\
\textbf{\getAff{wtt}}Wild Tree Tech, Zurich, Switzerland\\
\textbf{\getAff{tou}}Department of Computing and Communications, The Open University, Great
Britain\\
\textbf{\getAff{uobpsych}}School of Psychological Science, University of Bristol, Bristol, Great
Britain\\
\bigskip
* Corresponding author: [email protected]\\
\end{flushleft}
% Please keep the abstract below 300 words
\section*{Abstract}
Computational science has been greatly improved by the use of containers
for packaging software and data dependencies. In a scholarly context,
the main drivers for using these containers are transparency and support
of reproducibility; in turn, a workflow's reproducibility can be greatly
affected by the choices that are made with respect to building
containers. In many cases, the build process for the container's image
is created from instructions provided in a \texttt{Dockerfile} format.
In support of this approach, we present a set of rules to help
researchers write understandable \texttt{Dockerfile}s for typical data
science workflows. By following the rules in this article, researchers
can create containers suitable for sharing with fellow scientists, for
including in scholarly communication such as education or scientific
papers, and for effective and sustainable personal workflows.
% Please keep the Author Summary between 150 and 200 words
% Use first person. PLOS ONE authors please skip this step.
% Author Summary not valid for PLOS ONE submissions.
\section*{Author summary}
Computers and algorithms are ubiquitous in research. Therefore, defining
the computing environment, i.e., the body of all software used directly
or indirectly by a researcher, is important, because it allows other
researchers to recreate the environment to understand, inspect, and
reproduce an analysis. A helpful abstraction for capturing the computing
environment is a container, whereby a container is created from a set of
instructions in a recipe. For the most common containerisation software,
Docker, this recipe is called a Dockerfile. We believe that in a
scientific context, researchers should follow specific practices for
writing a Dockerfile. These practices might be somewhat different from
the practices of generic software developers in that researchers often
need to focus on transparency and understandability rather than
performance considerations. The rules presented here are intended to
help researchers, especially newcomers to containerisation, leverage
containers for open and effective scholarly communication and
collaboration while avoiding the pitfalls that are especially irksome in
a research lifecycle. The recommendations cover a deliberate approach to
Dockerfile creation, formatting and style, documentation, and habits for
using containers.
\linenumbers
% Use "Eq" instead of "Equation" for equation citations.
\hypertarget{introduction}{%
\section*{Introduction}\label{introduction}}
\addcontentsline{toc}{section}{Introduction}
Computing infrastructure has advanced to the point where not only can we
share data underlying research articles, but we can also share the code
that processes these data. The sharing of code files is enabled by
collaboration platforms such as \href{https://github.com}{GitHub} or
\href{https://gitlab.com}{GitLab} and is becoming an increasingly common
practice. The sharing of the computing environment is enabled by
containerisation, which allows for documenting and sharing entire
workflows in a comprehensive way. Importantly, this sharing of
computational assets is paramount for increasing the reproducibility of
computational research. While papers based on the traditional journal
article format can share extensive details about the research,
computational research is often far too complicated to be effectively
disseminated in this format {[}1{]}. Approaches such as containerisation
are needed to support computational research, or when analysing or
visualising data, because a paper's actual contribution to knowledge
includes the full computing environment that produced a result {[}2{]}.
Containerisation helps provide instructions for packaging the building
blocks of computer-based research (i.e., code, data, documentation, and
the computing environment). Specifically, containers are built from
plain text files that represent a human- \textbf{and} machine-readable
recipe for creating the computing environment and interacting with data.
By providing this recipe, authors of scientific articles greatly improve
their work's level of documentation, transparency, and reusability. This
is an important part of common practice for scientific computing
{[}3,4{]}. An overall goal of these practices is to ensure that both the
author and others are able to reproduce and extend an analysis workflow.
The containers built from these recipes are portable encapsulated
snapshots of a specific computing environment that are both more
lightweight and transparent than virtual machines. Such containers have
been demonstrated for capturing scientific notebooks {[}5{]} and
reproducible workflows {[}6{]}.
While several tutorials exist on how to use containers for reproducible
research ({[}7--11{]} and Gruening and colleagues {[}12{]} give very
helpful recommendations for packaging reusable software in a container),
there is no detailed manual for how to write the actual instructions to
create the containers for computational research besides generic best
practice guides {[}13,14{]}. Here we introduce a set of recommendations
for producing container configurations in the context of data science
workflows using the popular \texttt{Dockerfile} format, summarised in
Fig~\ref{fig:summary}.
\begin{figure}[h]
{\centering \includegraphics[width=0.5\linewidth]{figures/summary}
}
\caption{Summary of the 10 simple rules for writing \texttt{Dockerfile}s for reproducible data science.}\label{fig:summary}
\end{figure}
\hypertarget{prerequisites-scope}{%
\section{Prerequisites \& scope}\label{prerequisites-scope}}
To start with, we assume the existence of a scripted scientific
workflow, i.e.~you can, at least at a certain point in time, execute the
full process with a fixed set of commands, for example
\texttt{make\ prepare\_data} followed by \texttt{Rscript\ analysis.R},
or only \texttt{python3\ my-workflow.py}. To maximise reach, we assume
that containers, which you eventually share with others, can only run
open source software; tools like Mathematica and Matlab are out of scope
for this example. A workflow that does not support scripted execution is
also out of scope for reproducible research, as it does not fit well
with containerisation. Furthermore, workflows interacting with many
petabytes of data and executed in high-performance computing (HPC)
infrastructures are out of scope. Using such HPC job managers or cloud
infrastructures would require a collection of ``Ten Simple Rules''
articles in their own right. For the HPC use case, we encourage the
reader to look at Singularity {[}15{]}. For this article, we focus on
workflows that typically run on single machine, e.g., a researcher's own
laptop computer or a virtual server. The reader might scope the data
requirement to under a terabyte, and compute requirement to a machine
with 16 cores running over the weekend.
Although it is outside the scope of this article, we point readers to
\texttt{docker-compose} {[}16{]} in the case where one might need
container orchestration for multiple applications, e.g., web servers,
databases, and worker containers. A \texttt{docker-compose.yml}
configuration file allows for defining mounts, environment variables,
and exposed ports and helps users stick to ``one purpose per
container'', which often means one process running in the container, and
to combine existing stable building blocks instead of bespoke massive
containers for specific purposes.
Because \emph{``the number of unique research environments approximates
the number of researchers''} {[}17{]}, sticking to conventions helps
every researcher to understand, modify, and eventually write container
recipes suitable for their needs. Even if they are not sure how the
underlying technology actually works, researchers should leverage
containerisation following good practices. The practices that are to be
discussed in this article are strongly related to software engineering
in general and research software engineering in particular, which is
concerned with quality, training, and recognition of software in science
{[}18{]}. We encourage you to reach out to your local or national
community of research software engineers (see
\href{https://en.wikipedia.org/wiki/Research_software_engineering}{list
of organisations}) if you have questions on software development in
research that go beyond the rules of this work.
While many different container technologies exist, this article focuses
on Docker {[}19{]}. Docker is a highly suitable tool for reproducible
research (e.g., {[}20{]}), and our observations indicate it is the most
widely used container technology in academic data science. The goal of
this article is to guide you as you write a \texttt{Dockerfile}, the
file format used to create Docker container images. The rules will help
you ensure that the \texttt{Dockerfile} allows for interactive
development as well as for reaching the higher goals of reproducibility
and preservation of knowledge. Such practices do not generally appear in
generic containerisation tutorials and they are rarely found in the
\texttt{Dockerfile}s published as part of software projects that are
often used as templates by novices. The differences between a helpful,
stable \texttt{Dockerfile} and one that is misleading, prone to failure,
and full of potential obstacles are not obvious, especially for
researchers who do not have extensive software development experience or
formal training. By committing to this article's rules, one can ensure
that their workflows are reproducible and reusable, that computing
environments are understandable by others, and that researchers have the
opportunity to collaborate effectively. Applying these rules should not
be triggered by the publication of a finished project but should instead
be weaved into day-to-day habits (cf.~thoughts on openness as an
afterthought by {[}21{]} and on computational reproducibility by
{[}2{]}).
\hypertarget{docker-and-dockerfiles}{%
\section{Docker and Dockerfiles}\label{docker-and-dockerfiles}}
Docker {[}19{]} is a container technology that has been widely adopted
and is supported on many platforms, and it has become highly useful for
research. Containers are distinct from virtual machines or hypervisors,
as they do not emulate hardware or operating system kernels and hence do
not require the same system resources. Several solutions for
facilitating reproducible research are built on top of containers
{[}17,22--25{]}, but these solutions intentionally hide most of the
complexity from the researcher.
To create Docker containers for specific workflows, we write text files
that follow a particular format called \texttt{Dockerfile} {[}26{]}. A
\texttt{Dockerfile} is a machine- \textbf{and} human-readable recipe,
comparable to a \texttt{Makefile} {[}27{]}, for building
\textbf{images}. Here, images are executable files that include the
application, e.g., the programming language interpreter needed to run a
workflow, and the system libraries required by an application to run.
Thus, a \texttt{Dockerfile} consists of a sequence of instructions to
copy files and install software. Each instruction adds a layer to the
image, which can be cached across image builds for minimising build and
download times. Once an image is built or downloaded, it is then
launched as a running instance known as a \textbf{container}. The images
have a main executable exposed as an ``entrypoint'' that is started when
they are run as stateful containers. Further, containers can be
modified, stopped, restarted, and purged.
A visual analogy for building and running a container is provided in
Fig~\ref{fig:analogy}. Akin to compiling source code for a programming
language, creating a container also starts with a plain text file
(\texttt{Dockerfile}), which provides instructions for building an
image. Similar to using a compiled binary file to launch a program, the
image is then run to create a container instance. See
Listing~\ref{lst:full} for a full \texttt{Dockerfile}, which we will
refer to throughout this article.
\begin{figure}[h]
{\centering \includegraphics[width=1\linewidth]{figures/analogy}
}
\caption{The workflow to create Docker containers by analogy. Containers begin with a \texttt{Dockerfile}, a recipe for building the computational environment (analogous to source code in a compiled programming language). This is used to build an image with the \texttt{docker build} command, analogous to compiling the source code into an executable (binary) file. Finally, the image is used to launch one or more containers with the \texttt{docker run} command (analogous to running an instance of the compiled binary as a process).}\label{fig:analogy}
\end{figure}
While Docker was the original technology to support the
\texttt{Dockerfile} format, other container technologies now offer
support for it, including
\href{https://podman.io/}{podman}/\href{https://github.com/containers/buildah}{buildah}
supported by RedHat,
\href{https://github.com/GoogleContainerTools/kaniko}{kaniko},
\href{https://github.com/genuinetools/img}{img}, and
\href{https://github.com/moby/buildkit}{buildkit}. The container
software Singularity {[}15{]}, which is optimised for scientific
computing and the security needs of HPC environments, uses its own
format, called the \emph{Singularity recipe}, but it can also import and
run Docker images. The rules here are, to some extent, transferable to
Singularity recipes.
While some may argue against publishing reproducibly, e.g., due to a
lack of time and incentives, a reluctance to share (cf.~{[}28{]}), and
the substantial technical challenges involved in maintaining software
and documentation, it should become increasingly straightforward for the
average researcher to provide computational environment support for
their publication in the form of a \texttt{Dockerfile}, a pre-built
Docker image, or another type of container. If a researcher can find and
create containers or write a \texttt{Dockerfile} to address their most
common use cases, then, arguably, sharing it would not make for extra
work after this initial setup (cf.~\texttt{README.md} of {[}29{]}). In
fact, the \texttt{Dockerfile} itself represents powerful documentation
to show from where data and code were derived, i.e., downloaded or
installed, and, consequently, where a third party might obtain the data
again.
\scriptsize
\begin{lstlisting}[language=docker,caption={\texttt{Dockerfile} full example. The \texttt{Dockerfile} and all other files are published in the \texttt{full-demo} example, see Section~\nameref{examples}; the image \texttt{docker.io/nuest/datascidockerfiles:1.0.0} is a ready-to-use build of this example.},breaklines=true,label={lst:full}]
FROM docker.io/rocker/verse:3.6.2
### INSTALL BASE SOFTWARE #####################################################
# Install Java, needed for package rJava
RUN apt-get update && \
apt-get install -y default-jdk && \
rm -rf /var/lib/apt/lists/*
### INSTALL WORKFLOW TOOLS ####################################################
# Install system dependencies for R packages
RUN apt-get update && \
apt-get install -y \
# needed for RNetCDF, found via https://sysreqs.r-hub.io/pkg/RNetCDF
libnetcdf-dev libudunits2-dev \
# needed for git2r:
libgit2-dev
# Install R packages, based on https://github.com/rocker-org/geospatial/blob/master/Dockerfile
RUN install2.r --error \
RColorBrewer \
RNetCDF \
git2r \
rJava
WORKDIR /tmp
# Install Python tools and their system dependencies
RUN apt-get update && \
apt-get install -y python-pip && \
rm -rf /var/lib/apt/lists/*
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
# Download superduper image converter
RUN wget https://downloads.apache.org/pdfbox/2.0.19/pdfbox-app-2.0.19.jar
### ADD MY OWN SCRIPTS ########################################################
# Add workflow scripts
WORKDIR /work
COPY myscript.sh myscript.sh
COPY analysis.py analysis.py
COPY plots.R plots.R
# Configure workflow
ENV DATA_SIZE 42
# Uncomment the following lines to execute preprocessing tasks during build
#RUN python analysis.py
#RUN Rscript plots.R
### WORKFLOW CONTAINER FEATURE ################################################
# CMD from base image used for development, uncomment the following lines to
# have a "run workflow only" image
# CMD["./myscript.sh"]
### Usage instructions ########################################################
# Build the images with
# > docker build --tag datascidockerfiles:1.0.0 .
# Run the image interactively with RStudio, open it on http://localhost/
# > docker run -it -p 80:8787 -e PASSWORD=ten --volume $(pwd)/input:/input datascidockerfiles:1.0.0
# Run the workflow:
# > docker run -it --name gwf datascidockerfiles:1.0.0 /work/myscript.sh
# Extract the data:
# > docker cp gwf:/output/ ./outputData
# Extract the figures:
# > docker cp gwf:/work/figures/ ./figures
\end{lstlisting}
\normalsize
\hypertarget{rule-1-use-available-tools}{%
\section*{Rule 1: Use available
tools}\label{rule-1-use-available-tools}}
\addcontentsline{toc}{section}{Rule 1: Use available tools}
\ztitlerefsetup{title=1} \zlabel{rule:tools} \label{rule:tools} \zrefused{rule:tools}
Rule 1 could informally be described as ``Don't bother to write a
Dockerfile!''. Writing a \texttt{Dockerfile} from scratch can be
difficult, and even experts sometimes take shortcuts. A good initial
strategy is to look at tools that can help generate a
\texttt{Dockerfile} for you. The developers of such tools have likely
thought about and implemented good practices, and they may even have
incorporated newer practices when reapplied at a later point in time.
Therefore, the most important rule is to apply a multi-step process to
creating a \texttt{Dockerfile} for your specific use case.
First, you want to determine whether there is an existing image that you
can use; if so, you want to be able to use it and add the instructions
for doing so to your workflow documentation. As an example, you might be
doing some kind of interactive development. For interactive development
environments such as notebooks and development servers or databases, you
can readily find images that come installed with all the software that
you need. You can look for information about images in (a) the
documentation of the software you intend to use; (b) the Docker image
registry \href{https://hub.docker.com/}{Docker Hub}; or (c) the source
code projects of the software being used, as many developers today rely
on containers for development, testing, and teaching.
Second, if there is no suitable pre-existing image for your needs, you
might next look to well-maintained tools to help with
\texttt{Dockerfile} generation. These tools can add required software
packages to an existing image without you having to manually write a
\texttt{Dockerfile} at all. ``Well-maintained'' not only refers to the
tool's own stability and usability but also indicates that suitable base
images are used, typically from the official Docker library {[}30{]}, to
ensure that the container has the most recent security fixes for the
operating system in question. See the next section ``Tools for container
generation'' for details.
Third, if these tools do not meet your needs, you may want to write your
own \texttt{Dockerfile}. In this case, follow the remaining rules.
\hypertarget{tools-for-container-generation}{%
\subsection{Tools for container
generation}\label{tools-for-container-generation}}
\texttt{repo2docker} {[}25{]} is a tool maintained by
\href{https://jupyter.org/}{Project Jupyter} that can help to transform
a source code or data repository, e.g., GitHub, GitLab, or Zenodo, into
a container. The tool relies on common configuration files for defining
software dependencies and versions, and it supports a few more special
files; see the
\href{https://repo2docker.readthedocs.io/en/latest/config_files.html}{supported
configuration files}. As an example, we might install
\texttt{jupyter-repo2docker} and then run it against a repository with a
\texttt{requirements.txt} file, an indication of being a Python workflow
with dependencies on the \href{https://pypi.org/}{Python Package Index}
(PyPI), using the following command:
\footnotesize
\begin{Shaded}
\begin{Highlighting}[]
\ExtensionTok{jupyter-repo2docker}\NormalTok{ https://github.com/norvig/pytudes}
\end{Highlighting}
\end{Shaded}
\normalsize
The resulting container image installs the dependencies listed in the
requirements file, and it provides an entrypoint to run a notebook
server to interact with any existing workflows in the repository. Since
\texttt{repo2docker} is used within
\href{https://mybinder.org/}{MyBinder.org}, if you make sure your
workflow is ``Binder-ready,'' you and others can also obtain an online
workspace with a single click. However, one precaution to consider is
that the default command above will create a home for the current user,
meaning that the container itself would not be ideal to share; instead,
any researcher interested in interacting with the code inside should run
\texttt{repo2docker} themselves and create their own container. Because
\texttt{repo2docker} is deterministic, the environments are the same
(see~\hyperref[{rule:pinning}]{Rule~\ztitleref{rule:pinning}} for
ensuring the same software versions).
Additional tools to assist with writing \texttt{Dockerfile}s include
\texttt{containerit} {[}31{]} and \texttt{dockta} {[}32{]}.
\texttt{containerit} automates the generation of a standalone
\texttt{Dockerfile} for workflows in R. This utility can provide a
starting point for users unfamiliar with writing a \texttt{Dockerfile},
or it can, together with other R packages, provide a full image creation
and execution process without having to leave an R session.
\texttt{dockta} supports multiple programming languages and
configurations files, just as \texttt{repo2docker} does, but it attempts
to create a readable \texttt{Dockerfile} compatible with plain Docker
and to improve user experience by cleverly adjusting instructions to
reduce build time. While perhaps more useful for fine-tuning, linters
can also be helpful when writing Dockerfiles, by catching errors or
non-recommended formulations (see
\hyperref[{rule:usage}]{Rule~\ztitleref{rule:usage}}).
\hypertarget{tools-for-templating}{%
\subsection{Tools for templating}\label{tools-for-templating}}
It is likely that over time you will work on projects and develop images
that are similar in nature to each other. To avoid constantly repeating
yourself, you should consider adopting a standard workflow that will
give you a quick start for a new project. As an example, cookie cutter
templates {[}33{]} or community templates (e.g., {[}34{]}) can provide
the required structure and files (e.g., for documentation, continuous
integration (CI), and licenses), for getting started. If you decide to
build your own cookie cutter template, consider collaborating with your
community during development of the standard to ensure it will be useful
to others.
Part of your project template should be a protocol for publishing the
\texttt{Dockerfile} and even exporting the image to a suitable location,
e.g., a container registry or data repository, taking into consideration
how your workflow can receive a DOI for citation. A template is
preferable to your own set of base images because of the maintenance
efforts the base images require. Therefore, instead of building your own
independent solution, consider contributing to existing suites of images
(see \hyperref[{rule:base}]{Rule~\ztitleref{rule:base}}) and improving
these for your needs.
For any tool that you use, be sure to look at documentation for usage
and configuration options, and look for options to add metadata (e.g.,
labels see~\hyperref[{rule:document}]{Rule~\ztitleref{rule:document}}).
\hypertarget{rule-2-build-upon-existing-images}{%
\section*{Rule 2: Build upon existing
images}\label{rule-2-build-upon-existing-images}}
\addcontentsline{toc}{section}{Rule 2: Build upon existing images}
\ztitlerefsetup{title=2} \zlabel{rule:base} \label{rule:base} \zrefused{rule:base}
Many pre-built community and developer contributed Docker images are
publically available for anyone to pull, run and extend, without having
to replicate the image construction process. However, a good
understanding of how \emph{base images} and \emph{image tags} work is
crucial, as the image and tag that you choose has important implications
for your derived images and containers. It is good practice to use
\textbf{base images} that are maintained by the Docker library, so
called \emph{``official images''} {[}35{]}, which benefit from a review
for best practices and vulnerability scanning {[}13{]}. You can identify
these images by the missing user portion of the image name, which comes
before the \texttt{/}, e.g., \texttt{r-base} or \texttt{python}.
However, these images only provide basic programming languages or very
widely used software, so you will likely use images maintained by
organisations or fellow researchers.
While some organisations can be trusted to update images with security
fixes (see list below), for most individual accounts that provide
ready-to-use images, it is likely that these will not be updated
regularly. Further, it's even possible that an image or a
\texttt{Dockerfile} could disappear, or an image could be published with
malicious intent (though we have not heard of any such case in
academia). Therefore, for security, transparency, and reproducibility,
you should only use images where you have access to the
\texttt{Dockerfile}. In case a repository goes away, we suggest that you
save a copy of the \texttt{Dockerfile} within your project (see
\hyperref[{rule:mount}]{Rule~\ztitleref{rule:mount}}).
The following list is a selection of communities that produce widely
used, regularly updated images, including ready-to-use images with
preinstalled collections of software configured to work out of the box.
Do take advantage of such images, especially for complex software
environments, e.g., machine learning tool stacks, or a specific
\href{https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms}{BLAS}
library.
\begin{itemize}
\tightlist
\item
\href{https://www.rocker-project.org/}{Rocker} for R and RStudio
images {[}20{]}
\item
\href{https://bioconductor.org/help/docker/}{Bioconductor Docker
images} for bioinformatics with R
\item
\href{https://hub.docker.com/_/neurodebian}{NeuroDebian images} for
neuroscience {[}36{]}
\item
\href{https://jupyter-docker-stacks.readthedocs.io/en/latest/index.html}{Jupyter
Docker Stacks} for Notebook-based computing
\item
\href{https://hub.docker.com/r/taverna/taverna-server}{Taverna Server}
for running Taverna workflows
\end{itemize}
For example, here is how we would use a base image \texttt{verse}, which
provides the popular Tidyverse suite of packages {[}37{]}, with R
version \texttt{3.5.2} from the \texttt{rocker} organisation on Docker
Hub (\texttt{docker.io}, which is the default and can be omitted).
\footnotesize
\begin{Shaded}
\begin{Highlighting}[]
\KeywordTok{FROM}\NormalTok{ docker.io/rocker/verse:3.6.2}
\end{Highlighting}
\end{Shaded}
\normalsize
\hypertarget{use-version-specific-tags}{%
\subsection{Use version-specific tags}\label{use-version-specific-tags}}
Images have \textbf{tags} associated with them, and these tags have
specific meanings, e.g., a semantic version indicator such as
\texttt{3.7} or \texttt{dev}, or variants like \texttt{slim} that
attempt to reduce image size. Tags are defined at the time of image
build and appear in the image name after the \texttt{:} when you use an
image, e.g., \texttt{python:3.7}. By \emph{convention} a missing tag is
assumed to be the word \texttt{latest}, which gives you the latest
updates but is also a moving target for your computing environment that
can break your workflow. Note that a version tag means that the tagged
software is frozen, but it does not mean that the image will not change,
as backwards compatible fixes (cf.~semantic versioning, {[}38{]}), e.g.,
version \texttt{1.2.3} that fixes a security problem in version
\texttt{1.2.2} or updates to an underlying system library, would be
published to the parent tag \texttt{1.2}.
For data science workflows, you should always rely on version-specific
image tags, both for base images that you use, and for images that you
build yourself and then run (see usage instructions in
Listing~\ref{lst:full} for an example of the \texttt{-\/-tag} parameter
of \texttt{docker\ build}). When keeping different versions (tags)
available, it is good practice to publish an image in an image registry.
For details, we refer you to the documentation on automated builds, see
\href{https://docs.docker.com/docker-hub/builds/}{Docker Hub Builds} or
\href{https://docs.gitlab.com/ee/user/packages/container_registry/index.html\#build-and-push-images}{GitLab's
Container Registry} as well as CI services such as
\href{https://github.com/actions/starter-workflows/tree/master/ci}{GitHub
actions}, or
\href{https://circleci.com/orbs/registry/orb/circleci/docker\#commands-build}{CircleCI}
that can help you get started. Do not \texttt{docker\ push} a locally
built image, because that counteracts the considerations outlined above.
If a pre-built image is provided in a public image registry, do not
forget to direct the user to it in your documentation, e.g., in the
\texttt{README} file or in an article.
\hypertarget{rule-3-format-for-clarity}{%
\section{Rule 3: Format for clarity}\label{rule-3-format-for-clarity}}
\ztitlerefsetup{title=3} \zlabel{rule:formatting} \label{rule:formatting} \zrefused{rule:formatting}
\ztitlerefsetup{title=3} \zlabel{rule:clarity} \label{rule:clarity} \zrefused{rule:clarity}
First, it is good practice to think of the \texttt{Dockerfile} as a
human- \textbf{and} machine-readable file. This means that you should
use indentation, new lines, and comments to make your
\texttt{Dockerfile} well documented and readable. Specifically,
carefully indent commands and their arguments to make clear what belongs
together, especially when connecting multiple commands in a \texttt{RUN}
instruction with \texttt{\&\&}. Use \texttt{\textbackslash{}} at the end
of a line to break a single command into multiple lines. This will
ensure that no single line gets too long to comfortably read. Content
spread across more and shorter lines also improves readability of
changes in version control systems. Further, use long versions of
parameters for readability (e.g., \texttt{-\/-input} instead of
\texttt{-i}). When you need to change a directory, use \texttt{WORKDIR},
because it not only creates the directory if it does not exist but also
persists the change across multiple \texttt{RUN} instructions.
Second, clarity of the steps within a Dockerfile is most important, and
if it requires verbosity or adds to the final image size, that is an
acceptable tradeoff. For example, if your container uses a script to run
a complex install routine, instead of removing it from the container
upon completion (a practice commonly seen in production
\texttt{Dockerfile}s aiming at small image sizes, cf.~{[}12{]}), you
should keep the script in the container for a future user to inspect;
the script size is negligible compared to the image size. One common
pattern you will encounter is a single and very lengthy \texttt{RUN}
instruction chaining multiple commands to install software and clean up
afterwards. For example (a) the instruction updates the database of
available packages, installs a piece of software from a package
repository, and purges the cache of the package manager, or (b) the
instruction downloads a software's source archive, unpacks it, builds
and installs the software, and then removes the downloaded archive and
all temporary files. Although this pattern creates instructions that may
be hard to read, it is very common and can even increase clarity within
the image file system because installation and build artifacts are gone.
In general, if your container is mostly software dependencies, you
should not need to worry about image size because (a) your data is
likely to have much larger storage requirements, and (b) transparency
and inspectability outweigh storage concerns in data science. If you
really need to reduce the size, you may look into using multiple
containers (cf.~{[}12{]}) or multi-stage builds {[}39{]}.
Depending on the programming language used, your project may already
contain files to manage dependencies, and you may use a package manager
to control this aspect of the computing environment. This is a very good
practice and helpful, though you should consider the externalisation of
content to outside of the \texttt{Dockerfile} (see
\hyperref[{rule:mount}]{Rule~\ztitleref{rule:mount}}). Often, a single
long \texttt{Dockerfile} with sections and helpful comments can be more
understandable than a collection of separate files.
Generally, aim to design the \texttt{RUN} instructions so that each
performs one scoped action, e.g., download, compile, and install one
tool. This makes the lines of your \texttt{Dockerfile} a well-documented
recipe for the user as well as a machine. Each instruction will result
in a new layer, and reasonably grouped changes increase readability of
the \texttt{Dockerfile} and facilitate inspection of the image, e.g.,
with tools like dive {[}40{]}. Convoluted \texttt{RUN} instructions can
be acceptable to reduce the number of layers, but careful layout and
consistent formatting should be applied.
Although you will find \texttt{Dockerfile}s that use
\href{https://docs.docker.com/engine/reference/commandline/build/\#set-build-time-variables---build-arg}{build-time
variables} to dynamically change parameters at build time, such a
customisation option reduces clarity for data science workflows.
\hypertarget{rule-4-document-within-the-dockerfile}{%
\section{Rule 4: Document within the
Dockerfile}\label{rule-4-document-within-the-dockerfile}}
\ztitlerefsetup{title=4} \zlabel{rule:document} \label{rule:document} \zrefused{rule:document}
\hypertarget{explain-in-comments}{%
\subsection{Explain in comments}\label{explain-in-comments}}
As you are writing the \texttt{Dockerfile}, be mindful of how other
people (including future you!) will read it and why. Are your choices
and commands being executed clearly, or are further comments warranted?
To assist others in making sense of your \texttt{Dockerfile}, you can
add comments that include links to online forums, code repository
issues, or version control commit messages to give context for your
specific decisions. For example
\href{https://github.com/Kaggle/docker-rstats/blob/master/Dockerfile}{this
\texttt{Dockerfile} by Kaggle} does a good job of explaining the
reasoning behind the contained instructions. If you copy instructions
from another \texttt{Dockerfile}, acknowledge the source in a comment.
Also, it can be helpful to include comments about commands that did not
work so you do not repeat past mistakes. Further, if you find that you
need to remember an undocumented step, that is an indication this step
should be documented in the \texttt{Dockerfile}. All instructions can be
grouped starting with a short comment, which also makes it easier to
spot changes if your \texttt{Dockerfile} is managed in some version
control system (see
\hyperref[{rule:publish}]{Rule~\ztitleref{rule:publish}}).
Listing~\ref{lst:comments} shows a selection of typical kinds of
comments that are useful to include in a \texttt{Dockerfile}.
\scriptsize
\begin{minipage}{\linewidth}
\begin{lstlisting}[language=docker,caption={Partial \texttt{Dockerfile} with examples for helpful comments.},breaklines=true,label={lst:comments}]
...
# apt-get install specific version, use 'apt-cache madison <pkg>'
# to see available versions
RUN apt-get install python3-pandas=0.23.3+dfsg-4ubuntu1
# install required R packages; before log the used repository
# for better provenance in the build log
RUN R -e 'getOption("repos")' && \
install2.r \
fortunes \
here
# this library must be installed from source to get version newer
# than in apt sources
RUN git clone http://url.of/repo && \
cd repo && \
make build && \
make install
\end{lstlisting}
\end{minipage}
\normalsize
\hypertarget{add-metadata-as-labels}{%
\subsection{Add metadata as labels}\label{add-metadata-as-labels}}
Docker automatically captures useful information in the image metadata,
such as the version of Docker used for building the image. The
\href{https://docs.docker.com/engine/reference/builder/\#label}{\texttt{LABEL}
instruction} can add \textbf{custom metadata} to images. You can view
all labels and other image metadata with
\href{https://docs.docker.com/engine/reference/commandline/inspect/}{\texttt{docker\ inspect}}
command. Listing~\ref{lst:labels} shows the most relevant ones for data
science workflows. Labels serve as structured metadata that can be
leveraged by services, e.g., https://microbadger.com/labels. For
example, software versions of containerised applications (cf.~{[}12{]}),
licenses, and maintainer contact information are commonly seen, and they
are very useful if a \texttt{Dockerfile} is discovered out of context.
Regarding licensing information, this should include the license of your
own code and could point to a \texttt{LICENSE} file within the image
(cf.~{[}12{]}). While you can add arbitrarily complex information with
labels, for data science scenarios the user-facing documentation is much
more important. Relevant metadata that might be utilised with future
tools include global identifiers such as \href{https://orcid.org/}{ORCID
identifiers}, DOIs of the research compendium
(cf.~\url{https://research-compendium.science}), e.g.,
\href{https://help.zenodo.org/}{reserved on Zenodo}, or a funding
agency's grant number. You can use the
\href{https://docs.docker.com/engine/reference/builder/\#arg}{\texttt{ARG}
instruction} to pass variables at build time, for example to pass values
into labels, such as the current date or version control revision.
However, a script or \texttt{Makefile} might be required so that you do
not forget that you set the argument, or how you set it (see
\hyperref[{rule:usage}]{Rule~\ztitleref{rule:usage}}).