Skip to content

Data processing overview

yasima-csiro edited this page Apr 27, 2023 · 17 revisions

How to process Darwin Core (DwC) data from CSV to SOLR index using GBIF/ALA Pipelines

Each data "source" requires an entry in the ALA Collections application (metadata store - aka Collectory). When a new data resource is minted, it gets a unique ID in the form of dr12345 (see example for BPA Portal). This ID is then used throughout the processing as the "key" for this dataset.

Therefore any new datasets must first have a metadata ID assigned by creating a new entry in collections.ala.org.au (or collections-test.ala.org.au for testing and non-production loading). Matt can give you access to allow the creation of new entries via https://collections-test.ala.org.au/admin. Use an existing ARGA dataset as a template.

DwC to DwCA conversion

To convert a plain CSV DwC file to a Darwin Core Archive file (DwCA), the preingestion Python module is used (note the use of the arga branch). The output is a ZIP file containing 3 or more files:

  • eml.xml - metadata about the dataset, info is pulled in from the collectory
  • meta.xml - metadata about the DwC data, has URIs for field headings in the CSV file and contains relational information when multiple CSV files are provided
  • occurrence.csv - DwC CSV file (can also be a TSV if specified in meta.xml)

Example Python script to perform this conversion, where the input CSV file has been saved in /data/arga-data/bpa_export.csv.

from dwca import DwcaHandler
from dwca import CsvFileType

data_resource = "dr18544"
core_file = CsvFileType(files=['/data/arga-data/bpa_export.csv'], type='occurrence')
DwcaHandler.create_dwca(dr=data_resource, core_csv=core_file,
                        output_dwca_path=f"/data/biocache-load/{data_resource}/{data_resource}.zip", use_chunking=False)

# Copy over to server via:
# rsync -avz --progress /data/biocache-load/dr18544 [email protected]:"/data/biocache-load/dr18544/"

Pipelines processing

ARGA uses the livingatlas set of pipeline processes, found at https://github.com/gbif/pipelines/tree/dev/livingatlas. The README file has all the information needed to run the processing on a local development machine.

The intermediate outputs of the processing are AVRO files, which can be viewed (as JSON) using the avro-tools utility (see README). The final solr step adds records to the SOLR index, which can be viewed via its web interface.

ARGA-specific config - save to configs/la-pipelines-local.yaml:

index:
  includeSensitiveDataChecks: false
  includeImages: false
collectory:
  httpHeaders:
    Authorization: xxxxx-xxxx-xxxx-xxxx
alaNameMatch:
  wsUrl: https://namematching-ws.ala.org.au
  timeoutSec: 100
  retryConfig:
    maxAttempts: 5
    initialIntervalMillis: 5000

Note the Authorization key needs to be obtained (Slack Nick).

Code snippet to run the ARGA pipelines

cd scripts
# set shell variable for dataset - dr18509  dr18541  dr18544
# DR="dr18509  dr18541  dr18544"
DR=dr18509
./la-pipelines dwca-avro $DR && ./la-pipelines interpret $DR --embedded && ./la-pipelines uuid $DR --embedded && ./la-pipelines index $DR --embedded && ./la-pipelines sample $DR --embedded  && ./la-pipelines solr $DR --embedded

Checking AVRO files

 avro-tools tojson ./occurrence/location/interpret-1663602980-00000-of-00008.avro | jq '. | select(.country=="Australia")' | more

Some example SOLR queries

Processing data on Nectar server

The test server https://nectar-arga-dev-4.ala.org.au/ is where SOLR is running but is also the place where new data is ingested via Pipelines. The process is roughly"

  1. Extract DwC CSV files on a local computer and convert to DwCA via the preingestion script
  2. Copy the resulting file to the server via the command:
rsync -avz --progress /data/biocache-load/dr18544 [email protected]:"/data/biocache-load/dr18544/"

where username is your login name for the server

  1. login to the server via ssh and cd to the pipelines directory:
ssh [email protected]
cd /data/pipelines/livingatlas/
  1. Follow the instructions at https://github.com/gbif/pipelines/tree/master/livingatlas to setup and ingest a dataset. I usually do the following:
cd scripts
DR=dr18509 # set ENV variable for dataset ID. Indexing multiple datasets are space separated - DR="dr18509 dr18541 dr18544"
/la-pipelines dwca-avro $DR && ./la-pipelines interpret $DR --embedded && ./la-pipelines uuid $DR --embedded && ./la-pipelines index $DR --embedded && ./la-pipelines sample $DR --embedded && ./la-pipelines solr $DR --embedded

This can take a while to run so sometimes I run it inside a screen command so that if my Mac goes to sleep it doesn't interrupt the process. If any of the commands error, you'll need to debug the output to see why. Assuming its all worked as expected, you should this output at the end:

05-Apr 15:50:34 [LA-PIPELINES] [dr18541] [INFO] END SOLR of dr18541 in [spark-embedded], took 12 minutes and 2 seconds.
  1. Check the SOLR index or the React demo app to confirm the data was indexed (exercise for the reader).

Gotchas

  • Pipelines adds extra data fields to records at the taxon level, via species species lists (marked as authoritative). These include conservation status, traits and taxonLocatedInCountry fields. They are currently stored in the test lists server but some lists will need to be added to the production server at some point (export CSV and import as new list entry).
  • the taxonLocatedInCountry list is https://lists-test.ala.org.au/speciesListItem/list/dr18685 and is a very large list. It may have its authoritative flag turned off (set to false) at times, due to it slowing the indexing of ALA data. So if this fields is not getting populated, check that list has Authoritative: Yes under the Info section (which is hidden by default - click the Info button to show). Someone with ROLE_ADMIN can edit that list via the "Edit" button. image
  • Pipelines has dependencies on quite a few services and some are run on the same machine via Docker (see README). Docker can sometimes fill-up the root partition on the server and this can be fixed by clearing out old and unused images (via dedicated Docker commands - google it). SOLR is also running in Docker and can sometimes crash and not come up - this is usually fixed by runnging the docker-compose up command again but occassionally it needs to be created from scratch and all data is lost. In that case all datasets need to be reindexed again but only the SOLR pipeline needs to be run via ./la-pipelines solr $DR --embedded`.