A R implementation of LMDX (Perot et
al. 2023).
You provides a pdf page (or pdf file) in, and a decoding schema (json),
and you get all entities extracted from the pdf.
You can install the development version of LMDX and its prerequisites from GitHub with:
# install.packages("pak")
pak::pak("mlverse/chattr")
pak::pak("cregouby/LMDX")
We want here to extract the R short reference card pdf file content, and turn it into a data.frame:
R reference card page 1 screenshotIt is a challenge as it is composed of 3 tight columns and packed between code and highly summarized sentences.
The taxonomy here is a json representation of the entities to extract from the document. Depending on the LLM model capacity, taxonomy can be hierarchical like in the following example:
Here we can see that the document is structured in paragraphs like
Getting Help, then Input and output, and so on. This is the first
layer of the hierarchy, and each paragraph has a title and a
description.
Then for each paragraph, there is multiple blocks that are made of an R
command, description and maybe an example.
So this is what the taxonomy looks like according to this.
taxonomy <- jsonlite::minify('{
"title" : "",
"paragraph_item": [
{
"title": "",
"description": [],
"line_item": [
{
"command": "",
"description": "",
"example": []
}
]
}
]
}')
prompt is made with the assembly of the document text with layout information and the taxonomy.
library(LMDX)
document <- system.file("extdata", "Short-refcard_1.pdf", package = "LMDX")
prompt <- lmdx_prompt(document, taxonomy, segment = "line")
Let’s have a look at the prompt result :
prompt[[1]] |> stringr::str_trunc(500)
#> <Document>
#> R Reference Card 132|63
#> by Tom Short, EPRI PEAC, [email protected] 2004-11-07 88|87
#> Granted to the public domain. See www.Rpad.org for the source and latest 141|97
#> version. Includes material from R for Beginners by Emmanuel Paradis (with 141|106
#> permission). 37|116
#> Getting help 53|153
#> Most R functions have online documentation. 73|165
#> help(topic) documentation on topic 101|174
#> ?topic id. 42|184
#> help.search("topic") search the help system 132|193
#> apropos("topic") the names of all...
prompt[[1]] |> stringr::str_trunc(500, side = "left")
#> ...s 661|540
#> = n!/[(n − k)!k!] 576|549
#> na.omit(x) suppresses the observations with missing data (NA) (sup- 672|559
#> presses the corresponding line if x is a matrix or a data frame) 663|569
#> na.fail(x) returns an error message if x contains at least one NA 658|578
#> </Document><Task>
#> From the document, extract the text values and tags of the following entities:
#> {"title":"","paragraph_item":[{"title":"","description":[],"line_item":[{"command":"","description":"","example":[]}]}]}
#> </Task>
#> <Extraction>
prompt
is a list textual prompts conform to the original paper taht
what we want the LLM model to process.
The usual way for this is to call an LLM model served online. We use {chattr} package for that, as it also includes a local model usage capability.
We query 16 generation of the model with a temperature of 0.5.
library(chattr)
response <- ch_submit_job(
prompt = prompt,
defaults = chattr_defaults(model_arguments = list("temperature" = 0.5))
)
This is not run here, paper report good result with the PaLMv2 model but choose your own model and report the result !
This consists in decoding the output and parsing it to a majority-vote engine :
# response
r_reference_card_df <- majority_vote(decode_json_result(response))