Skip to content

Commit 11dff5e

Browse files
feat: add latex serializer
Signed-off-by: Peter Staar <[email protected]>
1 parent e54524d commit 11dff5e

File tree

9 files changed

+1870
-0
lines changed

9 files changed

+1870
-0
lines changed

docling_core/transforms/serializer/latex.py

Lines changed: 744 additions & 0 deletions
Large diffs are not rendered by default.

test/data/doc/2408.09869v3_enriched.gt.tex

Lines changed: 465 additions & 0 deletions
Large diffs are not rendered by default.

test/data/doc/activities.gt.tex

Lines changed: 60 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
\documentclass[11pt,a4paper]{article}
2+
3+
\usepackage[utf8]{inputenc} % allow utf-8 input
4+
\usepackage[T1]{fontenc} % use 8-bit T1 fonts
5+
\usepackage{hyperref} % hyperlinks
6+
\usepackage{url} % simple URL typesetting
7+
\usepackage{booktabs} % professional-quality tables
8+
\usepackage{amsfonts} % blackboard math symbols
9+
\usepackage{nicefrac} % compact symbols for 1/2, etc.
10+
\usepackage{microtype} % microtypography
11+
\usepackage{xcolor} % colors
12+
\usepackage{graphicx} % graphics
13+
14+
\begin{document}
15+
16+
\section{Summer activities}
17+
18+
\section{Swimming in the lake}
19+
20+
Duck
21+
22+
\begin{figure}[h]
23+
% image
24+
\caption{Figure 1: This is a cute duckling}
25+
\end{figure}
26+
27+
\section{Let's swim!}
28+
29+
To get started with swimming, first lay down in a water and try not to drown:
30+
31+
\begin{itemize}
32+
\item ∞ You can relax and look around
33+
\item ∞ Paddle about
34+
\item ∞ Enjoy summer warmth
35+
\end{itemize}
36+
37+
Also, don't forget:
38+
39+
\begin{itemize}
40+
\item 1. Wear sunglasses
41+
\item 2. Don't forget to drink water
42+
\item 3. Use sun cream
43+
\end{itemize}
44+
45+
Hmm, what else…
46+
47+
\begin{itemize}
48+
\item -Another activity item
49+
\item -Yet another one
50+
\item -Stopping it here
51+
\end{itemize}
52+
53+
Some text.
54+
55+
\begin{itemize}
56+
\item -Starting the next page with a list item.
57+
\item -Second item.
58+
\end{itemize}
59+
60+
\end{document}

test/data/doc/construct_doc.gt.tex

Lines changed: 127 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
\documentclass[11pt,a4paper]{article}
2+
3+
\usepackage[utf8]{inputenc} % allow utf-8 input
4+
\usepackage[T1]{fontenc} % use 8-bit T1 fonts
5+
\usepackage{hyperref} % hyperlinks
6+
\usepackage{url} % simple URL typesetting
7+
\usepackage{booktabs} % professional-quality tables
8+
\usepackage{amsfonts} % blackboard math symbols
9+
\usepackage{nicefrac} % compact symbols for 1/2, etc.
10+
\usepackage{microtype} % microtypography
11+
\usepackage{xcolor} % colors
12+
\usepackage{graphicx} % graphics
13+
\title{Title of the Document}
14+
15+
\begin{document}
16+
17+
\maketitle
18+
19+
\begin{itemize}
20+
\item item of leading list
21+
\end{itemize}
22+
23+
Author 1
24+
Affiliation 1
25+
26+
Author 2
27+
Affiliation 2
28+
29+
\section{1. Introduction}
30+
31+
This paper introduces the biggest invention ever made. ...
32+
33+
\begin{itemize}
34+
\item list item 1
35+
\item list item 2
36+
\item list item 3
37+
\begin{enumerate}
38+
\item list item 3.a
39+
\item list item 3.b
40+
\item list item 3.c
41+
\begin{enumerate}
42+
\item list item 3.c.i
43+
\end{enumerate}
44+
\end{enumerate}
45+
\item list item 4
46+
\end{itemize}
47+
48+
\begin{table}[h]
49+
\caption{This is the caption of table 1.}
50+
\begin{tabular}{|l|l|l|}
51+
\hline
52+
Product & Years & Years \\ \hline
53+
Product & 2016 & 2017 \\ \hline
54+
Apple & 49823 & 695944 \\ \hline
55+
\end{tabular}
56+
\end{table}
57+
58+
\begin{figure}[h]
59+
% image
60+
\caption{This is the caption of figure 1.}
61+
\end{figure}
62+
63+
\begin{figure}[h]
64+
% image
65+
\caption{This is the caption of figure 2.}
66+
\end{figure}
67+
68+
\begin{itemize}
69+
\item item 1 of list
70+
\end{itemize}
71+
72+
\begin{itemize}
73+
\item item 1 of list after empty list
74+
\item item 2 of list after empty list
75+
\end{itemize}
76+
77+
\begin{itemize}
78+
\item item 1 of neighboring list
79+
\item item 2 of neighboring list
80+
\begin{itemize}
81+
\item item 1 of sub list
82+
\item Here a code snippet: \texttt{print("Hello world")} (to be displayed inline)
83+
Here a code snippet: \texttt{print("Hello world")} (to be displayed inline)
84+
\item Here a formula: $E=mc^2$ (to be displayed inline)
85+
Here a formula: $E=mc^2$ (to be displayed inline)
86+
\end{itemize}
87+
\end{itemize}
88+
89+
Here a code block:
90+
91+
\begin{verbatim}
92+
print("Hello world")
93+
\end{verbatim}
94+
95+
Here a formula block:
96+
97+
$$E=mc^2$$
98+
99+
% missing-key-value-item
100+
101+
% missing-form-item
102+
103+
Some formatting chops: \textbf{bold} \textit{italic} \underline{underline} \sout{strikethrough} $_{subscript}$ $^{superscript}$ \href{.}{hyperlink} \& \href{https://github.com/DS4SD/docling}{\sout{\underline{\textit{\textbf{everything at the same time.}}}}}
104+
105+
\begin{enumerate}
106+
\item Item 1 in A
107+
\item Item 2 in A
108+
\item Item 3 in A
109+
\begin{enumerate}
110+
\item Item 1 in B
111+
\item Item 2 in B
112+
\begin{enumerate}
113+
\item Item 1 in C
114+
\item Item 2 in C
115+
\end{enumerate}
116+
\item Item 3 in B
117+
\end{enumerate}
118+
\item Item 4 in A
119+
\end{enumerate}
120+
121+
\begin{itemize}
122+
\item List item without parent list group
123+
\end{itemize}
124+
125+
The end.
126+
127+
\end{document}
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
\documentclass[11pt,a4paper]{article}
2+
3+
\usepackage[utf8]{inputenc} % allow utf-8 input
4+
\usepackage[T1]{fontenc} % use 8-bit T1 fonts
5+
\usepackage{hyperref} % hyperlinks
6+
\usepackage{url} % simple URL typesetting
7+
\usepackage{booktabs} % professional-quality tables
8+
\usepackage{amsfonts} % blackboard math symbols
9+
\usepackage{nicefrac} % compact symbols for 1/2, etc.
10+
\usepackage{microtype} % microtypography
11+
\usepackage{xcolor} % colors
12+
\usepackage{graphicx} % graphics
13+
\title{Rich tables}
14+
15+
\begin{document}
16+
17+
\maketitle
18+
19+
\begin{table}[h]
20+
\begin{tabular}{|l|l|}
21+
\hline
22+
cell 0,0 & cell 0,1 \\ \hline
23+
cell 1,0 & \textit{text in italic} \\ \hline
24+
\begin{itemize} \item list item 1 \item list item 2 \end{itemize} & cell 2,1 \\ \hline
25+
cell 3,0 & \begin{table}[h] \begin{tabular}{|l|l|l|} \hline inner cell 0,0 & inner cell 0,1 & inner cell 0,2 \\ \hline inner cell 1,0 & inner cell 1,1 & inner cell 1,2 \\ \hline \end{tabular} \end{table} \\ \hline
26+
Some text in a generic group. More text in the group. & cell 4,1 \\ \hline
27+
\end{tabular}
28+
\end{table}
29+
30+
\end{document}

test/data/doc/dummy_doc.gt.tex

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
\documentclass[11pt,a4paper]{article}
2+
3+
\usepackage[utf8]{inputenc} % allow utf-8 input
4+
\usepackage[T1]{fontenc} % use 8-bit T1 fonts
5+
\usepackage{hyperref} % hyperlinks
6+
\usepackage{url} % simple URL typesetting
7+
\usepackage{booktabs} % professional-quality tables
8+
\usepackage{amsfonts} % blackboard math symbols
9+
\usepackage{nicefrac} % compact symbols for 1/2, etc.
10+
\usepackage{microtype} % microtypography
11+
\usepackage{xcolor} % colors
12+
\usepackage{graphicx} % graphics
13+
\title{DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis}
14+
15+
\begin{document}
16+
17+
\maketitle
18+
19+
\begin{figure}[h]
20+
% image
21+
\caption{Figure 1: Four examples of complex page layouts across different document categories}
22+
% annotation[classification]: bar chart
23+
% annotation[description]: ...
24+
% annotation[molecule_data]: CC1=NNC(C2=CN3C=CN=C3C(CC3=CC(F)=CC(F)=C3)=N2)=N1
25+
\end{figure}
26+
27+
% annotation[description]: A description annotation for this table.
28+
29+
\end{document}
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
\documentclass[11pt,a4paper]{article}
2+
3+
\usepackage[utf8]{inputenc} % allow utf-8 input
4+
\usepackage[T1]{fontenc} % use 8-bit T1 fonts
5+
\usepackage{hyperref} % hyperlinks
6+
\usepackage{url} % simple URL typesetting
7+
\usepackage{booktabs} % professional-quality tables
8+
\usepackage{amsfonts} % blackboard math symbols
9+
\usepackage{nicefrac} % compact symbols for 1/2, etc.
10+
\usepackage{microtype} % microtypography
11+
\usepackage{xcolor} % colors
12+
\usepackage{graphicx} % graphics
13+
\title{Contribution guideline example}
14+
15+
\begin{document}
16+
17+
\maketitle
18+
19+
This is simple.
20+
21+
Foo \textit{emphasis} \textbf{strong emphasis} \textit{\textbf{both}} .
22+
23+
Create your feature branch: \texttt{git checkout -b feature/AmazingFeature} .
24+
25+
\begin{enumerate}
26+
\item Pull the \href{https://github.com/docling-project/docling}{\textbf{repository}} .
27+
Pull the \href{https://github.com/docling-project/docling}{\textbf{repository}} .
28+
\item Create your feature branch ( \texttt{git checkout -b feature/AmazingFeature} )
29+
Create your feature branch ( \texttt{git checkout -b feature/AmazingFeature} )
30+
\item Commit your changes ( \texttt{git commit -m 'Add some AmazingFeature'} )
31+
Commit your changes ( \texttt{git commit -m 'Add some AmazingFeature'} )
32+
\item Push to the branch ( \texttt{git push origin feature/AmazingFeature} )
33+
Push to the branch ( \texttt{git push origin feature/AmazingFeature} )
34+
\item Open a Pull Request
35+
\item \textbf{Whole list item has same formatting}
36+
\item List item has \textit{mixed or partial} formatting
37+
List item has \textit{mixed or partial} formatting
38+
\end{enumerate}
39+
40+
\title{\textit{Whole heading is italic}}
41+
42+
Some \texttt{formatted_code}
43+
44+
\section{\textit{Partially formatted} heading to\_escape \texttt{not_to_escape} $E=mc^2$ \& ampersand}
45+
46+
\textit{Partially formatted} heading to\_escape \texttt{not_to_escape} $E=mc^2$ \& ampersand
47+
48+
A hyperlink on \texttt{code in a line}
49+
50+
\begin{verbatim}
51+
A hyperlink on code as paragraph
52+
\end{verbatim}
53+
54+
The end.
55+
56+
\end{document}

0 commit comments

Comments
 (0)