A tool to standardize and geocode Philadelphia addresses
You will need the following things:
- An addresses file, provided to you by CityGeo.
- An AIS API key, provided to you by CityGeo.
- Python installed on your computer, at least version 3.9
To download, use git:
git clone [email protected]:CityOfPhiladelphia/address-geocoder.git
If you have not set up authentication with git on your machine before, reference this guidance on GitHub.
Alternatively, you can download the repository as a zip file using GitHub's web interface.
Next, navigate to the project's directory and create a virtual environment:
python -m venv .venv
Then, activate the virtual environment. This will need to be activated every time you want to run the enrichment tool, not just this once:
source .venv/bin/activate
Finally, install the packages in requirements.text:
pip install -r requirements.txt
Once you have installed everything, it is time to fill in the config file.
Address Geocoder takes an input file containing addresses and adds latitude and longitude to those addresses, as well as any optional fields that the user supplies.
In order to run Address Geocoder, first set up the configuration file. By default,
Address Geocoder searchers for a file named config.yml. This is the recommended config filename. You can copy the template in
config_example.yml to a file named config.yml and continue from there. Detailed steps for filling out the config file are in the next section.
Then, run:
python3 geocoder.py
The dialogue will ask you to specify a config file. Hit enter without typing anything to keep the default config file ('./config.yml')
- Copy
config_example.ymltoconfig.ymlby running in the terminal:
cp config_example.yml config.yml
- Add your AIS API Key here:
AIS_API_KEY:
- Add the filepath for the input file (the file that you wish to enrich), and the geography file (the address file you have been given.) This should look something like this:
input_file: ./data/example_input_4.csv
geography_file: ./data/addresses.parquet
- Map the address fields to the name of the fields in the csv that you wish to process. If you have one combined address field, map it to full_address_field. Otherwise, leave full_address_field blank and map column names to street, city, state, and zip. Street must be included, while the others are optional.
Example, for a csv with the following fields:
addr_st, addr_city, addr_zip
input_file: 'example.csv'
full_address_field:
address_fields:
street: addr_st
city: addr_city
state:
zip: addr_zip
- List which fields other than latitude and longitude you want to add. (Latitude and longitude will always be added.) If you enter an invalid field, the program will error out and ask you to try again. A complete list of valid fields can be found further down in this README.
enrichment_fields:
- census_tract_2020
- census_block_group_2020
- census_block_2020
The full config file should look something like this:
# Connection Credentials
AIS_API_KEY: YOUR_API_KEY
# File Config
input_file: ./data/example_input_4.csv
geography_file: ./data/addresses.parquet
full_address_field: address
# OR, IF ADDRESS IS SPLIT INTO MULTIPLE COLUMNS:
address_fields:
street:
city:
state:
zip:
# Enrichment Fields -- Aside from coordinates, what fields to add
enrichment_fields:
- census_tract_2020
- census_block_group_2020
- census_block_2020
Note that if one of the input fields has a column the same name as
- You're now ready to run the geocoder:
python3 geocoder.py
The dialogue will ask you to specify a config file. Hit enter without typing anything to keep the default config file ('./config.yml')
The output file will be saved in the same location as your input file, with _enriched attached to the filename.
Address-Geocoder processes a csv file with addresses, and geolocates those
addresses using the following steps:
- Takes an input file of addresses, and standardizes those
addresses using
passyunk, Philadelphia's address standardization system. - Compares the standardized data to a local parquet file,
addresses.parquet, and adds the user-specified fields as well as latitude and longitude from that file - Not all records will match to the address file. For those records that do not match,
Address-Geocoderqueries the Address Information System (AIS) API and adds returned fields. Please note that this process can take some time, so processing large files with a messy address field is not recommended. As an example, if you have a file that needs 1,000 rows to be sent to AIS, this will take approximately 3-4 minutes. - The enriched file is then saved to the same directory as the input file.
This package uses the pytest module to conduct unit tests. Tests are
located in the tests/ folder.
In order to run all tests, for example:
python3 pytest tests/
To run tests from one file:
python3 pytest tests/test_parser.py
To run one test within a file:
python3 pytest tests/test_parser.py::test_parse_address
Note that if any of the fields in the input file have the same name as an enrichment field, the incoming input file field will be renamed to have the _left suffix.
Field |
|---|
address_high |
address_low_frac |
address_low_suffix |
address_low |
bin |
census_block_2010 |
census_block_2020 |
census_block_group_2010 |
census_block_group_2020 |
census_tract_2010 |
census_tract_2020 |
center_city_district |
clean_philly_block_captain |
commercial_corridor |
council_district_2016 |
council_district_2024 |
cua_zone |
dor_parcel_id |
eclipse_location_id |
elementary_school |
engine_local |
high_school |
highway_district |
highway_section |
highway_subsection |
historic_district |
historic_site |
historic_street |
ladder_local |
lane_closure |
leaf_collection_area |
li_address_key |
li_district |
major_phila_watershed |
middle_school |
neighborhood_advisory_committee |
opa_account_num |
opa_address |
opa_owners |
philly_rising_area |
planning_district |
police_district |
police_division |
police_service_area |
political_division |
political_ward |
ppr_friends |
pwd_account_nums |
pwd_center_city_district |
pwd_maint_district |
pwd_parcel_id |
pwd_pressure_district |
pwd_treatment_plant |
pwd_water_plate |
recycling_diversion_rate |
rubbish_recycle_day |
sanitation_area |
sanitation_convenience_center |
sanitation_district |
seg_id |
state_house_rep_2012 |
state_house_rep_2022 |
state_senate_2012 |
state_senate_2022 |
street_code |
street_light_route |
street_name |
street_postdir |
street_predir |
street_suffix |
traffic_district |
traffic_pm_district |
unit_num |
unit_type |
us_congressional_2012 |
us_congressional_2018 |
us_congressional_2022 |
zip_4 |
zip_code |
zoning_document_ids |
zoning_rco |
zoning |
flowchart TB
A["Input Address"] --> B@{ label: "Is it a Philadelphia address? If unknown, assume it's Philadelphia." }
B -- Yes --> C["Match to address file"]
B -- No --> D["Match to TomTom"]
C -- Match --> E["Return geocoded address with enrichment fields"]
C -- No Match --> F["Is the address an intersection?"]
D -- Match --> G["Return geocoded address, but no enrichment fields"]
D -- No Match --> H["Return non-match"]
F -- Yes --> I["Get intersection latitude and longitude from AIS"]
F -- No --> J["Run AIS address match"]
I --> K["Get address through AIS reverse lookup"]
J -- Match --> E
J -- No Match --> D
K --> J
A@{ shape: manual-input}
B@{ shape: decision}
C@{ shape: process}
D@{ shape: process}
E@{ shape: terminal}
F@{ shape: decision}
G@{ shape: terminal}
H@{ shape: terminal}
I@{ shape: process}
J@{ shape: process}
K@{ shape: process}
style B fill:#BBDEFB
style C fill:#FFE0B2
style D fill:#FFE0B2
style E fill:#C8E6C9
style F fill:#BBDEFB
style G fill:#FFF9C4
style H fill:#FFCDD2
style I fill:#FFE0B2
style J fill:#FFE0B2
style K fill:#FFE0B2