-
Notifications
You must be signed in to change notification settings - Fork 0
Description
This is probably a pipelines process that will compare the taxon for a given record with a list of all known taxa for Australia.
The know list of Australian taxa should be derived from the ALA Biocache data, using a filter for country:Australia (uses AUS EEC layer).
CSV download:
https://biocache.ala.org.au/occurrences/facets/download?q=*%3A*&qualityProfile=ALA&facets=taxon_name
https://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=*%3A*&qualityProfile=ALA&facets=taxon_name
https://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=country:Australia&qualityProfile=AVH&facets=taxonConceptID&count=true&file=AU_all_taxa_tc_counts.csv
Trying to generate a list of taxa for that query using SOLR or biocache-service is difficult due to the huge result set size and the API times out trying.
One option is to use SOLR with deep pagination using Use the CSV download (above) to get data into Pipelines. The existing cursors.
Another is to run the query on Pipelines via Spark and save the result in S3. This seems to be the safest and most reliable option.species-list pipeline would be a good starting point in the code. This pipeline accesses the ALA list API to pull down KV data and populate avro files using the taxon as a primary key.
It needs a field name for this data, something like presentInCountry:Australia. There might be an existing term for this, so needs some research.