Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I get information from checkboxes in tables? #165

Open
2 tasks done
ddotta opened this issue Jul 18, 2024 · 3 comments
Open
2 tasks done

How can I get information from checkboxes in tables? #165

ddotta opened this issue Jul 18, 2024 · 3 comments

Comments

@ddotta
Copy link

ddotta commented Jul 18, 2024

Prework

Question

I'm trying to extract data from a pdf document that contains tables with checkboxes (see my reproducible example below).

The extract_tables() function works well and manages to identify the tables in the pdf document, but I only get NA for all the checkboxes.
Is there any way of identifying which boxes are checked?
Many thanks for your help ! 🙏

Reproducible example

Here's my pdf
test.pdf

And my code :

library(tabulapdf)

fichier <- "test.pdf"
tableaux <- extract_tables(fichier, output = "tibble")

bases_de_conjoncture <- tableaux[[1]]
sources <- tableaux[[2]]

What I get :

# A tibble: 33 × 3
   `CERISE (Espace de Production des données)`                                 ...2       ...3          
   <chr>                                                                       <chr>      <chr>         
 1 Préciser ci-dessous la liste des sources statistiques (cf. liste sur GEDSI) NA         NA            
 2 Rubrique Source                                                             Producteur Chargé d'étude
 3 000_Referentiels                                                            NA         NA            
 4 0010_Balsa_IAA                                                              NA         NA            
 5 0020_Balsa_EA                                                               NA         NA            
 6 0030_Balsa_v2_EA                                                            NA         NA            
 7 0040_Geo                                                                    NA         NA            
 8 0050_BDNU                                                                   NA         NA            
 9 010_Territoires                                                             NA         NA            
10 1010_Enquete_TERUTI                                                         NA         NA            
11 020_Meteorologie                                                            NA         NA            
12 2010_Conj_meteo                                                             NA         NA            
13 030_Structures_exploitations                                                NA         NA            
14 3010_Enquetes_Structures                                                    NA         NA            
15 3020_Recensements                                                           NA         NA            
16 040_Pratiques_agricoles                                                     NA         NA            
17 4000_Pratiques_Culturales                                                   NA         NA            
18 4010_Pratiques_grandes_cultures                                             NA         NA            
19 4040_Pratiques_arboriculture                                                NA         NA            
20 4050_Pratiques_elevage                                                      NA         NA            
21 4060_Conso_energie_EA                                                       NA         NA            
22 4070_Conso_energie_EDT_CUMA                                                 NA         NA            
23 050_Productions_vegetales                                                   NA         NA            
24 5010_Terres_labourables                                                     NA         NA            
25 5030_Conj_Prairies                                                          NA         NA            
26 5040_Conj_viticole                                                          NA         NA            
27 5050_Conj_fruits                                                            NA         NA            
28 5060_Conj_legumes                                                           NA         NA            
29 060_Productions_viandes_oeufs                                               NA         NA            
30 6010_Enquetes_cheptels                                                      NA         NA            
31 6020_Abattage_gros_animaux                                                  NA         NA            
32 6030_Abattage_volailles_lapins                                              NA         NA            
33 6035_Abattages                                                              NA         NA     
@ddotta
Copy link
Author

ddotta commented Jul 18, 2024

I managed to do what I wanted with pdftools::pdf_text() and some complications.

It would be very useful if this could be implemented directly in extract_tables()

@pachadotdev
Copy link
Contributor

hi @ddotta
thanks for reporting this
how did you manage to do this?

@ddotta
Copy link
Author

ddotta commented Jul 19, 2024

@pachadotdev
Here's a solution - not very optimized but does what I want
https://gist.github.com/ddotta/8e828145355bb87e78d83191b747b2e0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants