Scrapper prepares organisation.data.xml.csv from publishers' organisation XML files and publishers.data.scrapping.csv from publishers information from the IATI Registry.
For each organisation data, the script checks (see OrganisationCollection>checkAndUpdate)
- whether the organisation-list part of the identifier is valid or not based on the org-id.guide
- whether the organisation identifier is present in IATI organisation codelist or not
- if the identifer already exists, then the metadata is updated if there's a change
- if the name already exists, it ignores that organisation and uses the initial identifier that has been saved
- else the data is added to the csv list for importing to the database
- source are in
src/cleanup - Run
python initial_cleanup.pyto cleanup organisation data
It reads data/organisation.data.xml.csv and data/publishers.data.scrapping.csv and generates out/organisations-clean.csv containing valid organisations information.
The organisations-clean.csv is cleaned-up manually if needed.
- source are in
src/dump - copy
config.py.baktoconfig.py - create postgres database and update
config.pywith credentials - Run
python dump.pywhich readsorganisations-clean.csvand dumps the data into the database you have just created