Skip to content

Commit 7a56ea7

Browse files
committed
Finish read-me file and add PDF document.
1 parent 4eee6c7 commit 7a56ea7

File tree

3 files changed

+112
-35
lines changed

3 files changed

+112
-35
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@ bld/
3131
*.fls
3232
*.log
3333
*.out
34-
*.pdf
34+
#*.pdf
3535
*.run.xml
3636
*.synctex.gz
3737
*.nav

README.md

Lines changed: 111 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,49 +1,126 @@
1-
# Title for final project
1+
# Detecting and Cleaning Table Data from PDFs Using Deep Learning and PyTesseract
22

3-
1. Generating or training somehow the program to be able to read PDF tables to Excel.
4-
1. Replicate a paper that was retracted called: "Priming the Concept of Fullness with
5-
Visual Sequences Reduces Portion Size Choice in Online Food Ordering".
3+
![MIT license](https://img.shields.io/github/license/yourusername/your-repository)
4+
[![image](https://zenodo.org/badge/12345678.svg)](https://zenodo.org/badge/latestdoi/12345678)
5+
[![Documentation Status](https://readthedocs.org/projects/your-project-name/badge/?version=stable)](https://your-project-name.readthedocs.io/en/stable/)
6+
[![image](https://github.com/yourusername/your-repository/actions/workflows/main.yml/badge.svg)](https://github.com/yourusername/your-repository/actions/workflows/main.yml)
7+
[![image](https://codecov.io/gh/yourusername/your-repository/branch/main/graph/badge.svg)](https://codecov.io/gh/yourusername/your-repository)
68

7-
For me to make the final decision, I would like to discuss if the first one is feasible
8-
first.
9+
This project focuses on:
910

10-
# Templates for Reproducible Research Projects in Economics
11+
1. Training a deep learning model to detect tabular data on PDFs.
12+
1. Detection and extraction of a specific PDF file with complex tables.
13+
1. Cleaning of the data extracted.
1114

12-
![MIT license](https://img.shields.io/github/license/OpenSourceEconomics/econ-project-templates)
13-
[![image](https://zenodo.org/badge/14557543.svg)](https://zenodo.org/badge/latestdoi/14557543)
14-
[![Documentation Status](https://readthedocs.org/projects/econ-project-templates/badge/?version=stable)](https://econ-project-templates.readthedocs.io/en/stable/)
15-
[![image](https://github.com/OpenSourceEconomics/econ-project-templates/actions/workflows/main.yml/badge.svg)](https://github.com/OpenSourceEconomics/econ-project-templates/actions/workflows/main.yml)
16-
[![image](https://codecov.io/gh/OpenSourceEconomics/econ-project-templates/branch/main/graph/badge.svg)](https://codecov.io/gh/OpenSourceEconomics/econ-project-templates)
17-
[![pre-commit.ci status](https://results.pre-commit.ci/badge/github/OpenSourceEconomics/econ-project-templates/main.svg)](https://results.pre-commit.ci/latest/github/OpenSourceEconomics/econ-project-templates/main)
18-
19-
This project provides a template for economists aimed at facilitating the production of
20-
reproducible research using the most commonly used programming languages in the field,
21-
such as Python, R, Julia, and Stata.
15+
It uses the model called `"COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"` from
16+
`Detectron2` for training the model. The database used in the training was made by me,
17+
it contains 25 images in which the columns have been marked. For the detection of the
18+
text `PyTesseract` was used.
2219

2320
> [!NOTE]
24-
> Although the underlying architecture supports all listed programming languages, the
25-
> current template implementation is limited to Python and R.
21+
> **PyTesseract** can sometimes have issues reading words correctly. Some issues with
22+
> OCR accuracy may require manual verification before serious use.
23+
24+
## How to Run the Project
25+
26+
To run this project, follow these steps:
27+
28+
### 1. Clone this repository
29+
30+
### 2. Create and activate environment
31+
32+
```console
33+
$ mamba env create -f environment.yml
34+
$ conda activate final_project_btb
35+
```
36+
37+
### 3. Download data
38+
39+
You have two options to download the data:
40+
41+
1. Via Google Drive: Click on this link
42+
(\[https://drive.google.com/file/d/1ha7JIu2NRsnpCufi6PjMHNMyqfmNB_z8/view?usp=sharing\])
43+
and then click "Download".
44+
1. Via Dropbox: Click on this link
45+
(\[https://www.dropbox.com/s/j3k3kkl97sw9ocy/model_final-2.pth?st=3nt83ul7&dl=0\])
46+
and then click "Download".
47+
48+
### 4. Place data in the data folder of src/final_project_btb
49+
50+
Path: final-project-s33btorr/src/final_project_btb/data
51+
52+
### 5. Run Pytask command
53+
54+
```console
55+
pytask
56+
```
57+
58+
## Short explanation of the project
59+
60+
### Motivation
61+
62+
My motivation for this project stems from the fact that I could not find any pre-trained
63+
model, software, or package that could accurately read the table I needed given its
64+
complexity. Therefore, I trained a model using images similar to those I need to
65+
extract, allowing me to automate the extraction of a large number of pages in the
66+
future. With other programs, this process would take hours and result in a significant
67+
number of errors.
68+
69+
### Overview
70+
71+
In this project, I have trained a deep learning model to detect the columns of a table
72+
from scanned PDFs using the **Roboflow** dataset I generated. After training, the model
73+
can identify the positions of different table columns. The extracted data is then
74+
processed and cleaned for analysis.
75+
76+
### Dataset
77+
78+
You can access the dataset used for training via the following link:
79+
[Roboflow Dataset](https://app.roboflow.com/test-ypjyd/my-first-project-jqmvu/10)
80+
81+
### Training the Model
82+
83+
To train the model, I used the following approach:
84+
85+
1. The model was trained using a GPU provided by **Google Colab**.
86+
1. The model was saved after training as `model_final-2.pth`.
87+
1. The model is capable of detecting the columns in the table from a specific scanned
88+
PDF.
89+
90+
You can view and download to modify the code used to train the model in this notebook:
91+
[Training Model Notebook](https://www.dropbox.com/s/dcgerv5i1yp217a/training_model.ipynb?st=5qaeufd1&dl=0)
92+
93+
### Making Predictions
94+
95+
Once the model is trained, it is saved as `model_final-2.pth`. This file is used to:
2696

27-
## Getting Started
97+
1. Extract the text using **PyTesseract**. I noticed that PyTesseract leaves a blank
98+
cell whenever the text goes to a new line. This can be used to determine the
99+
boundaries of each row in the table.
100+
1. Predict column positions in new PDF tables with similar structure. Use this
101+
predictions to know where the text is located regarding the columns.
28102

29-
You can find all necessary resources to get started on our
30-
[documentation](https://econ-project-templates.readthedocs.io/en/stable/).
103+
### Cleaning the Data
31104

32-
## Contributing
105+
After extracting the data, the following cleaning steps were performed:
33106

34-
We welcome suggestions on anything from improving the documentation to reporting bugs
35-
and requesting new features. Please open an
36-
[issue](https://github.com/OpenSourceEconomics/econ-project-templates/issues) in these
37-
cases.
107+
1. **Handling Missing Values**: If the first column is empty, it means that row is part
108+
of the previous one. These rows are merged accordingly.
109+
1. **Numeric Fields**: Cleaned numeric fields to ensure they are in a format suitable
110+
for analysis (e.g., removing non-numeric characters).
111+
1. **CVS and Graph Generation**: Generated the cvs file and some basic graphs to
112+
visualize the cleaned data.
38113

39-
If you want to work on a specific feature, we are more than happy to get you started!
40-
Please [get in touch briefly](https://www.wiwi.uni-bonn.de/gaudecker), this is a small
41-
team so there is no need for a detailed formal process.
114+
## References
42115

43-
### Contributors
116+
### Code for training model:
44117

45-
@hmgaudecker @timmens @tobiasraabe @mj023
118+
**Shen, Zejiang, Zhang, Kaixuan, & Dell, Melissa. (2020).** "A Large Dataset of
119+
Historical Japanese Documents with Complex Layouts"
120+
[arXiv:2004.08686](https://arxiv.org/abs/2004.08686)
46121

47-
### Former Contributors
122+
### Model used for training:
48123

49-
@janosg @PKEuS @philippmuller @julienschat @raholler
124+
Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., & Girshick, R. (2019). **Detectron2**.
125+
Retrieved from
126+
[https://github.com/facebookresearch/detectron2](https://github.com/facebookresearch/detectron2).
2.91 MB
Binary file not shown.

0 commit comments

Comments
 (0)