Some of the Indian electoral rolls are searchable, with a separate text layer in the right encoding (see here). Most are not. Here, we provide scripts that parse unsearchable rolls from the following states: Bihar, Chandigarh, Delhi (English), Haryana, Himachal Pradesh, Jharkhand, Madhya Pradesh, Rajasthan, Uttar Pradesh, and Uttarakhand.
We have a script for each state given the format for each state varies slightly. The python script takes as input path to specific pdf electoral rolls that need to be parsed and produces a CSV with the following columns generally---the precise set of columns varies by state:
number (top left box in the elector field), id, elector_name, father_or_husband_name,
husband (dummy for husband), house_no, age, sex, ac_name, parl_constituency, part_no,
year, state, filename, main_town, police_station, mandal, revenue_division, district,
pin_code, polling_station_name, polling_station_address, net_electors_male,
net_electors_female, net_electors_third_gender, net_electors_total
We do some basic checks for the quality of the data including checks on data types and missing values and the size of the field. For instance, data type check may look like numeric in numeric fields, and by size of the field, we mean, for example, number of characters in a name or in a pin_code.
- Assam
- Bihar
- Chandigarh
- Dadra
- Daman
- Delhi
- Haryana
- Himachal Pradesh
- Jharkhand
- Karnataka
- Kerala
- Lakshadweep
- Madhya Pradesh
- Maharashtra
- Odisha
- Punjab
- Rajasthan
- Sikkim
- Tamil Nadu
- Telangana
- Tripura
- Uttar Pradesh
- Uttarakhand
- West Bengal
The final data is posted here.
We tried both polyglot and indic_trans. Both have issues but indic_trans is better. indicate is yet better.
Madhu Sanjeevi and Gaurav Sood