|
1 | | -# Title for final project |
| 1 | +# Detecting and Cleaning Table Data from PDFs Using Deep Learning and PyTesseract |
2 | 2 |
|
3 | | -1. Generating or training somehow the program to be able to read PDF tables to Excel. |
4 | | -1. Replicate a paper that was retracted called: "Priming the Concept of Fullness with |
5 | | - Visual Sequences Reduces Portion Size Choice in Online Food Ordering". |
| 3 | + |
| 4 | +[](https://zenodo.org/badge/latestdoi/12345678) |
| 5 | +[](https://your-project-name.readthedocs.io/en/stable/) |
| 6 | +[](https://github.com/yourusername/your-repository/actions/workflows/main.yml) |
| 7 | +[](https://codecov.io/gh/yourusername/your-repository) |
6 | 8 |
|
7 | | -For me to make the final decision, I would like to discuss if the first one is feasible |
8 | | -first. |
| 9 | +This project focuses on: |
9 | 10 |
|
10 | | -# Templates for Reproducible Research Projects in Economics |
| 11 | +1. Training a deep learning model to detect tabular data on PDFs. |
| 12 | +1. Detection and extraction of a specific PDF file with complex tables. |
| 13 | +1. Cleaning of the data extracted. |
11 | 14 |
|
12 | | - |
13 | | -[](https://zenodo.org/badge/latestdoi/14557543) |
14 | | -[](https://econ-project-templates.readthedocs.io/en/stable/) |
15 | | -[](https://github.com/OpenSourceEconomics/econ-project-templates/actions/workflows/main.yml) |
16 | | -[](https://codecov.io/gh/OpenSourceEconomics/econ-project-templates) |
17 | | -[](https://results.pre-commit.ci/latest/github/OpenSourceEconomics/econ-project-templates/main) |
18 | | - |
19 | | -This project provides a template for economists aimed at facilitating the production of |
20 | | -reproducible research using the most commonly used programming languages in the field, |
21 | | -such as Python, R, Julia, and Stata. |
| 15 | +It uses the model called `"COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"` from |
| 16 | +`Detectron2` for training the model. The database used in the training was made by me, |
| 17 | +it contains 25 images in which the columns have been marked. For the detection of the |
| 18 | +text `PyTesseract` was used. |
22 | 19 |
|
23 | 20 | > [!NOTE] |
24 | | -> Although the underlying architecture supports all listed programming languages, the |
25 | | -> current template implementation is limited to Python and R. |
| 21 | +> **PyTesseract** can sometimes have issues reading words correctly. Some issues with |
| 22 | +> OCR accuracy may require manual verification before serious use. |
| 23 | +
|
| 24 | +## How to Run the Project |
| 25 | + |
| 26 | +To run this project, follow these steps: |
| 27 | + |
| 28 | +### 1. Clone this repository |
| 29 | + |
| 30 | +### 2. Create and activate environment |
| 31 | + |
| 32 | +```console |
| 33 | +$ mamba env create -f environment.yml |
| 34 | +$ conda activate final_project_btb |
| 35 | +``` |
| 36 | + |
| 37 | +### 3. Download data |
| 38 | + |
| 39 | +You have two options to download the data: |
| 40 | + |
| 41 | +1. Via Google Drive: Click on this link |
| 42 | + (\[https://drive.google.com/file/d/1ha7JIu2NRsnpCufi6PjMHNMyqfmNB_z8/view?usp=sharing\]) |
| 43 | + and then click "Download". |
| 44 | +1. Via Dropbox: Click on this link |
| 45 | + (\[https://www.dropbox.com/s/j3k3kkl97sw9ocy/model_final-2.pth?st=3nt83ul7&dl=0\]) |
| 46 | + and then click "Download". |
| 47 | + |
| 48 | +### 4. Place data in the data folder of src/final_project_btb |
| 49 | + |
| 50 | +Path: final-project-s33btorr/src/final_project_btb/data |
| 51 | + |
| 52 | +### 5. Run Pytask command |
| 53 | + |
| 54 | +```console |
| 55 | +pytask |
| 56 | +``` |
| 57 | + |
| 58 | +## Short explanation of the project |
| 59 | + |
| 60 | +### Motivation |
| 61 | + |
| 62 | +My motivation for this project stems from the fact that I could not find any pre-trained |
| 63 | +model, software, or package that could accurately read the table I needed given its |
| 64 | +complexity. Therefore, I trained a model using images similar to those I need to |
| 65 | +extract, allowing me to automate the extraction of a large number of pages in the |
| 66 | +future. With other programs, this process would take hours and result in a significant |
| 67 | +number of errors. |
| 68 | + |
| 69 | +### Overview |
| 70 | + |
| 71 | +In this project, I have trained a deep learning model to detect the columns of a table |
| 72 | +from scanned PDFs using the **Roboflow** dataset I generated. After training, the model |
| 73 | +can identify the positions of different table columns. The extracted data is then |
| 74 | +processed and cleaned for analysis. |
| 75 | + |
| 76 | +### Dataset |
| 77 | + |
| 78 | +You can access the dataset used for training via the following link: |
| 79 | +[Roboflow Dataset](https://app.roboflow.com/test-ypjyd/my-first-project-jqmvu/10) |
| 80 | + |
| 81 | +### Training the Model |
| 82 | + |
| 83 | +To train the model, I used the following approach: |
| 84 | + |
| 85 | +1. The model was trained using a GPU provided by **Google Colab**. |
| 86 | +1. The model was saved after training as `model_final-2.pth`. |
| 87 | +1. The model is capable of detecting the columns in the table from a specific scanned |
| 88 | + PDF. |
| 89 | + |
| 90 | +You can view and download to modify the code used to train the model in this notebook: |
| 91 | +[Training Model Notebook](https://www.dropbox.com/s/dcgerv5i1yp217a/training_model.ipynb?st=5qaeufd1&dl=0) |
| 92 | + |
| 93 | +### Making Predictions |
| 94 | + |
| 95 | +Once the model is trained, it is saved as `model_final-2.pth`. This file is used to: |
26 | 96 |
|
27 | | -## Getting Started |
| 97 | +1. Extract the text using **PyTesseract**. I noticed that PyTesseract leaves a blank |
| 98 | + cell whenever the text goes to a new line. This can be used to determine the |
| 99 | + boundaries of each row in the table. |
| 100 | +1. Predict column positions in new PDF tables with similar structure. Use this |
| 101 | + predictions to know where the text is located regarding the columns. |
28 | 102 |
|
29 | | -You can find all necessary resources to get started on our |
30 | | -[documentation](https://econ-project-templates.readthedocs.io/en/stable/). |
| 103 | +### Cleaning the Data |
31 | 104 |
|
32 | | -## Contributing |
| 105 | +After extracting the data, the following cleaning steps were performed: |
33 | 106 |
|
34 | | -We welcome suggestions on anything from improving the documentation to reporting bugs |
35 | | -and requesting new features. Please open an |
36 | | -[issue](https://github.com/OpenSourceEconomics/econ-project-templates/issues) in these |
37 | | -cases. |
| 107 | +1. **Handling Missing Values**: If the first column is empty, it means that row is part |
| 108 | + of the previous one. These rows are merged accordingly. |
| 109 | +1. **Numeric Fields**: Cleaned numeric fields to ensure they are in a format suitable |
| 110 | + for analysis (e.g., removing non-numeric characters). |
| 111 | +1. **CVS and Graph Generation**: Generated the cvs file and some basic graphs to |
| 112 | + visualize the cleaned data. |
38 | 113 |
|
39 | | -If you want to work on a specific feature, we are more than happy to get you started! |
40 | | -Please [get in touch briefly](https://www.wiwi.uni-bonn.de/gaudecker), this is a small |
41 | | -team so there is no need for a detailed formal process. |
| 114 | +## References |
42 | 115 |
|
43 | | -### Contributors |
| 116 | +### Code for training model: |
44 | 117 |
|
45 | | -@hmgaudecker @timmens @tobiasraabe @mj023 |
| 118 | +**Shen, Zejiang, Zhang, Kaixuan, & Dell, Melissa. (2020).** "A Large Dataset of |
| 119 | +Historical Japanese Documents with Complex Layouts" |
| 120 | +[arXiv:2004.08686](https://arxiv.org/abs/2004.08686) |
46 | 121 |
|
47 | | -### Former Contributors |
| 122 | +### Model used for training: |
48 | 123 |
|
49 | | -@janosg @PKEuS @philippmuller @julienschat @raholler |
| 124 | +Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., & Girshick, R. (2019). **Detectron2**. |
| 125 | +Retrieved from |
| 126 | +[https://github.com/facebookresearch/detectron2](https://github.com/facebookresearch/detectron2). |
0 commit comments