Skip to content

Add data field for "taxa is present in Australia" #31

@nickdos

Description

@nickdos

This is probably a pipelines process that will compare the taxon for a given record with a list of all known taxa for Australia.

The know list of Australian taxa should be derived from the ALA Biocache data, using a filter for country:Australia (uses AUS EEC layer).

CSV download:

https://biocache.ala.org.au/occurrences/facets/download?q=*%3A*&qualityProfile=ALA&facets=taxon_name
https://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=*%3A*&qualityProfile=ALA&facets=taxon_name
https://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=country:Australia&qualityProfile=AVH&facets=taxonConceptID&count=true&file=AU_all_taxa_tc_counts.csv

https://biocache-ws.ala.org.au/ws/occurrences/facets/download?q=country:Australia&fq=taxonRankID:[6000 TO 7000]&qualityProfile=AVH&facets=scientificName,taxonConceptID&count=true&file=AU_all_taxa_counts

Trying to generate a list of taxa for that query using SOLR or biocache-service is difficult due to the huge result set size and the API times out trying.

One option is to use SOLR with deep pagination using cursors.
Another is to run the query on Pipelines via Spark and save the result in S3. This seems to be the safest and most reliable option.
Use the CSV download (above) to get data into Pipelines. The existing species-list pipeline would be a good starting point in the code. This pipeline accesses the ALA list API to pull down KV data and populate avro files using the taxon as a primary key.

It needs a field name for this data, something like presentInCountry:Australia. There might be an existing term for this, so needs some research.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions