The National Family & Health Survey (NFHS) is a survey in India that attempts to collect information on health conditions, nutrition, family planning, domestic violence, and a host of other factors through conducting surveys on a random ("representative") sample of Indian households in all states. The fifth NFHS was conducted through 2019-21, and the reports were released to the public in 2021 and can be found at this link.
One small problem, however, is that all the reports are provided as PDFs
, which are pretty neat for humans to read, but terrible for computers to parse. This repo contains scripts that will download all the district wise reports (704 of them), and extract data from the tables, and convert it into machine-friendly JSON
. NFHS provides district-wise, state-wise and entire country (aggregate) reports. This repo currently contains code to download, parse and generate JSONs
for district-wise reports only. Districtwise reports contain information on 104 "indicators" (or questions asked in the survey). Statewise reports seem to contain some extra information that is not reported district wise, and has approximately 130+ indicators.
Note: I tried my best to make sure the data is being parsed correctly, but there is a possibility that some data in JSON might not be 100% accurate - there is no way I could have manually verified all 704 PDF files and their outputs, so I randomly sampled and verified a couple of files, all of which looked okay. If you want to replicate the data parsing from PDFs, feel free to go through the *.py
files.
All code in this repository is released under the MIT License. The data (JSON, PDFs) are available as a Kaggle Dataset
-
District wise data is available at this link (web archive link).
-
From this webpage, we get the links to each of the statewise pages, which is saved in the
statewise_district_links.csv
file. -
Then, the
get_districtwise_links.py
script is used to compile the list of all district wise file URLs intodistrictwise_links.csv
. -
download_all_districts.py
is used to download PDFs and save them todistrictwise_data/pdfs
. During this process, it appears that the webpages for one state (Telangana) and one Union Territory (Chandigarh), currently point to a404
page. So data for these. -
It looks like district wise data for
Telangana
is available in the Telangana State Compendium - we slice this file up, district wise and save the PDFs in thedistrictwise_data/telangana
folder.Chandigarh
has only one district, which covers the entire union territory, so it probably won't have any separate "district-wise" data, as such. -
There are
704
district-wise PDF files, totalling to approximately450MB
of data. -
With all this done, we use
parse_pdf.py
to parse the PDF and dump district wise data to JSON (in the directorydistrictwise_data/json/
. This script uses Tabula and pdfminer.six for parsing PDFs. -
In the first round of PDF parsing, we used the
parse_pdf.py
script at commitce4f8ee
. Out of the 704 PDFs, we could generate JSONs successfully for only 563 files. 141 PDFs resulted in errors, which are listed below, along with what was done to solve the errors:
State | Failed | Total Files | Failed Filenames | Solving the issue |
---|---|---|---|---|
Madhya Pradesh | 50 | 50 | All files | Created a Tabula template file and used that. |
Rajasthans | 33 | 33 | All files | Created a Tabula template file and used that. |
telangana | 31 | 31 | All files | Turns out there was an image in the first (introduction) page, which I forgot to filter out. |
Himachal Pradesh | 12 | 12 | All files | Created a Tabula template file and used that. |
nct_of_delhi_ut | 11 | 11 | All files | Created a Tabula template file and used that. |
Maharashtra | 2 | 36 | raigarh and thane | Raigarh: In general, all districtwise data files had only 6 pages, so I added an assert statement to ensure that the PDF file has exactly six pages. Turns out Raigarhs file has 7 pages (one blank page extra on page 6). Also, added a Tabula template file for this. Thane's error was being caused due to a nan/empty value in the 'Indicator' column. |
West Bengal | 1 | 20 | jalpaiguri | Even this file has 7 pages, instead of 6. Also, data tables are located on pages [3, 4, 6] (instead of the usual pages [3, 4, 5]); page 5 is blank |
Gujarat | 1 | 22 | kheda | In page 4 of this PDf, the heading "NFHS-5 (2019-20)" took two lines instead of one, causing the parsing script to fail |
- The Tabula
template
files were generated manually by dragging and selecting the tables using the Tabula Desktop app for Linux. The saved template files are located in the directorytabula_templates
in this repository. Tabula 1.2.1, which was downloaded from this link (sha256sum:fea6a5d26e2ab1abf2cc0a694d93810c59e93e0ce9190fce31541fdf6e7e6ece tabula-jar-1.2.1.zip
).