Skip to content

Commit 7b03c60

Browse files
Merge pull request #234 from ARGA-Genomes/develop
Develop Merge
2 parents d40eb25 + b9373d5 commit 7b03c60

File tree

111 files changed

+2417
-2320
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

111 files changed

+2417
-2320
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ dataSources/**/data
3535
dataSources/**/examples
3636
dataSources/**/crawlerProgress
3737
dataSources/**/*.json
38+
dataSources/**/*.toml
3839
!dataSources/**/config.json
3940

4041
# Data Files #
@@ -52,6 +53,7 @@ dataSources/**/*.json
5253
.env
5354
*.DS_Store
5455
*.code-workspace
56+
secrets.toml
5557

5658
# Cached Files #
5759
*__pycache__

README.md

Lines changed: 25 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -2,30 +2,33 @@
22
ARGA will allow researchers to easily discover and access genomic data for Australian species in one place, faciliatating research and informing decision making in areas including conservation, biosecurity and agriculture.
33

44
## ARGA Data
5-
This repo is for Python and related code for data ingestion and pre-ingestion munging, prior to loading DwCA data into the Pipelines workflow.
5+
This repo is for Python and related code for data ingestion and pre-ingestion munging, prior to loading data into arga-backend.
66

77
## Setup
8-
Set up can be initiated by being in the base directory and running `deploy.py`, which should create a virtual environment and add a link to the src folder required to run the scripts. Further are provided as part of that script after it completes.
9-
10-
## Processing Data
11-
The configuration files in `dataSources/location/database` are how processing understands the procedure. To add a new source, you'll need to create a new sourceConfig file. To access that source, you can call one of the source processing tools with the syntax `location-database`. For example, to process the genbankSummary data source which is part of the ncbi, your source is called `ncbi-genbankSummary`, which is case sensitive.
12-
13-
The tools currently available are:
14-
- listSources.py
15-
- newSource.py
16-
- purgeSource.py
17-
- download.py
18-
- process.py
19-
- convert.py
20-
- getFields.py
21-
22-
`listSources` shows you a list of currently available sources.
23-
`newSource` creates a new source folder and basic config to kickstart adding a new source.
24-
`purgeSource` deletes a source that is no longer required.
25-
`download` is for simply downloading the data, which will vary based on your database type.
26-
`processing` is for running initial processing on files outlined in the source config.
27-
`convert` will then convert the old file to the new file mappings, as well as enrich and augment if required.
28-
`getFields` will read the pre-conversion file and give examples of field names and how they'll be mapped with the appropriate mapping file.
8+
Set up can be initiated by being in the base directory and running `deploy.py`, which should create a virtual environment and add a link to the src folder required to run the scripts. Further help is provided as part of that script after it completes. This process is platform independant.
9+
10+
## Data Sources
11+
This pipeline works with config files in `dataSources` folder of the main directory. At the top level, folders exist for each of the currently available data locations, such as `ncbi`. Within each location folder a database folder exists, which further divides the location into separate databases. For example, the `ncbi` folder mentioned previously has a subfolder named `nucleotide`, which is the nucleotide data available from the ncbi data location. At this level a config file exists which outlines how the dataset should be run to produce a mapped output file ready for ingestion, as well as a scripts folder for storing specific database scripts, and where a data folder will be created during pipeline process to store data.
12+
13+
In some cases the database is further divided into subsections, for situations such as an extremely large database that is best run in smaller sections, or where different types of databases are retrieved using identical methods but produce slightly different results depending on input. Subsections are listed in the `config.json` file and will cause created data files to be placed in a data folder within a subsection folder. For the previously used `nucleotide` database within the `ncbi` location, there are many subsections such as `invertebrate` which allow full processing of part of the entire database.
14+
15+
In order to reference these locations/databases/subsections you can use the chain of (location)-(database)-(subsection). Continuing with the previous example, to access the `invertebrate` subsection of the ncbi's nucleotide database, you would refer to the source as `ncbi-nucleotide-invertebrate`. Additionally, you may omit the last section of any reference to attempt to process all referenced sections. This means that by referring to the nucleotide source as `ncbi-nucleotide` and omitting the subsection the pipeline will attempt to process ALL subsections, and if a database has no subsections this is how you would refer to it. By taking this a step further, you are able to just reference `ncbi` when using the available source related tools to attempt to process ALL databases and all of their subsequent subsections.
16+
17+
## Tools
18+
All interaction with this pipeline are best done through the available command line tools. Many of these tools make use of the syntax mentioned above to interact with specific sources, although some tools have their own syntax, which you can discover by using the help flag (-h) when running any of the tools. A brief summary of the tools are;
19+
- newSource: Create a new source in the data sources folder, creating a barebones config file depending on the type of database provided.
20+
- listSoures: Print a list of currently available locations/databases/subsection.
21+
- purgeSource: Removes a source in the data sources folder and cleans it up.
22+
- download: Run the download process outlined in the config file.
23+
- process: Run the processing process outlined in the config file.
24+
- convert: Run the conversion process to remap and restructure data for ingestion.
25+
- package: Package up the converted file and the downloading/processing/converting metadata into a zip file.
26+
- update: Run download/process/convert/package sequentially, limited by the update information in the config
27+
- samplePreConversion: Collect a sample of the file that is to be converted.
28+
- sampleConversion: Collect a sample of the converted file.
29+
30+
## Data Storage Redirection
31+
A global `config.toml` file exists in the base directory for general global settings. The overwrites section can be replicated within any level of the data sources folder and the deeper `config.json` files will use that config. For example, many of the databases are quite large and so modifying the `storage` overwrite allows all downloading/processing/conversion files to be placed in a new location. To do this for all databases within the `ncbi` location, a `config.toml` file can be created within the `ncbi` folder with the overwrites section of the global config but either a relative or absolute path defined as the value, and all the databases will respect it. If instead you only wanted the `nucleotide` database to put it's data in a different location, you could instead place `config.toml` file within the `nucleotide` folder. This would allow all other `ncbi` databases to place their data as normal (whatever is outlined in the global `config.toml`), but have the `nucleotide` database place it's downloading/processing/conversion data in a separate location.
2932

3033
## Issues repository
3134
- [List of issues](https://github.com/ARGA-Genomes/arga-data/issues)

config.toml

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,15 @@
11
[folders]
2-
src = "./src" # Source folder for all python code
3-
dataSources = "./dataSources" # Location of all source related files
4-
logs = "./logs" # Location of all logging files
5-
package = "" # Location for packaged files to be put in, leave blank to leave in respective dataSource location
2+
src = "./src" # Source folder for all python code, cannot overwrite with local config files
3+
dataSources = "./dataSources" # Location of all source related files, cannot overwrite with local config files
4+
logs = "./logs" # Location of all logging files, cannot overwrite with local config files
5+
6+
[overwrites]
7+
storage = "" # Location overwrite for source data including downloading/processing/conversion, leave blank to keep in respective dataSource location, new location will have dataSources folder structure
8+
package = "" # Location overwrite for packaged files to be put in, leave blank to leave in respective dataSource location
9+
10+
[files]
11+
secrets = "./secrets.toml"
612

713
[settings]
814
logToConsole = true
9-
logLevel = "info" # Log levels: debug, info, warning, error, critical
15+
logLevel = "info" # Log levels: debug, info, warning, error, critical

dataSources/afd/checklist/processing.py renamed to dataSources/afd/checklist/scripts/processing.py

Lines changed: 12 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@
33
from pathlib import Path
44
import pandas as pd
55
from io import BytesIO
6-
from lib.bigFileWriter import BigFileWriter, Format
6+
from lib.bigFiles import DFWriter
77
from bs4 import BeautifulSoup
8-
from lib.progressBar import SteppableProgressBar
8+
from lib.progressBar import ProgressBar
99
import re
1010
import traceback
1111

@@ -27,7 +27,7 @@ def __init__(self, rawData: dict):
2727
self.children = [EntryData(child) for child in rawData.get("children", [])]
2828

2929
def retrieve(outputFilePath: Path):
30-
writer = BigFileWriter(outputFilePath, "sections", "section")
30+
writer = DFWriter(outputFilePath)
3131

3232
checklist = "https://biodiversity.org.au/afd/mainchecklist"
3333
response = requests.get(checklist).text
@@ -38,17 +38,17 @@ def retrieve(outputFilePath: Path):
3838
kingdomData = [EntryData(kingdom) for kingdom in json.loads(response[start:end])]
3939
downloadChildCSVs(kingdomData, writer, [])
4040

41-
writer.oneFile(False)
41+
writer.combine()
4242

43-
def downloadChildCSVs(entryData: list[EntryData], writer: BigFileWriter, parentRanks: list[str]) -> None:
43+
def downloadChildCSVs(entryData: list[EntryData], writer: DFWriter, parentRanks: list[str]) -> None:
4444
for entry in entryData:
4545
content = getCSVData(entry.key)
4646
higherTaxonomy = parentRanks + [entry.rank]
4747
if content is not None:
4848
df = buildDF(content)
4949
# df["higher_taxonomy"] = ";".join(higherTaxonomy)
50-
writer.writeDF(df)
51-
print(f"Wrote file #{len(writer.writtenFiles)}", end="\r")
50+
writer.write(df)
51+
print(f"Wrote file #{writer.writtenFileCount()}", end="\r")
5252
continue
5353

5454
# Content was too large to download
@@ -192,14 +192,10 @@ def enrich(filePath: Path, outputFilePath: Path) -> None:
192192

193193
enrichmentPath = outputFilePath.parent / f"{rank}.csv"
194194
if not enrichmentPath.exists():
195-
writer = BigFileWriter(enrichmentPath, rank, subfileType=Format.CSV)
196-
writer.populateFromFolder(writer.subfileDir)
197-
subfileNames = [file.fileName for file in writer.writtenFiles]
198-
199-
uniqueSeries = subDF["taxon_id"].unique()
200-
uniqueSeries = [item for item in uniqueSeries if item not in subfileNames]
195+
writer = DFWriter(enrichmentPath)
196+
uniqueSeries = subDF["taxon_id"].unique()[writer.writtenFileCount():] # Do not repeat already completed chunks
201197

202-
bar = SteppableProgressBar(50, len(uniqueSeries), f"{rank} Progress")
198+
bar = ProgressBar(len(uniqueSeries), f"{rank} Progress")
203199
for taxonID in uniqueSeries:
204200
bar.update()
205201

@@ -211,10 +207,9 @@ def enrich(filePath: Path, outputFilePath: Path) -> None:
211207
print(traceback.format_exc())
212208
return
213209

214-
recordDF = pd.DataFrame.from_records(records)
215-
writer.writeDF(recordDF, taxonID)
210+
writer.write(pd.DataFrame.from_records(records), taxonID)
216211

217-
writer.oneFile(False)
212+
writer.combine()
218213

219214
enrichmentDF = pd.read_csv(enrichmentPath, dtype=object)
220215
df = df.merge(enrichmentDF, "left", left_on=["taxon_id", "canonical_name"], right_on=["taxon_id", rank.lower()])

dataSources/ala/avh/config.json

Lines changed: 0 additions & 21 deletions
This file was deleted.

dataSources/ala/avh/processing.py

Lines changed: 0 additions & 26 deletions
This file was deleted.
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
{
2+
"datasetID": "0007001",
3+
"retrieveType": "script",
4+
"downloading": {
5+
"path": "./processing.py",
6+
"function": "collectBiocache",
7+
"args": [
8+
{
9+
"q": "*:*",
10+
"fq": "(basis_of_record:\"PRESERVED_SPECIMEN\" OR basis_of_record:\"MATERIAL_SAMPLE\" OR basis_of_record:\"LIVING_SPECIMEN\" OR basis_of_record:\"MATERIAL_CITATION\")",
11+
"disableAllQualityFilters": true,
12+
"qualityProfile": "ALA",
13+
"qc": "-_nest_parent_:*"
14+
},
15+
"{OUT-PATH}"
16+
],
17+
"output": "biocache.zip"
18+
},
19+
"processing": {
20+
"linear": [
21+
{
22+
"path": ".lib/zipping.py",
23+
"function": "extract",
24+
"args": [
25+
"{IN-PATH}",
26+
"{OUT-DIR}"
27+
],
28+
"output": "biocache"
29+
},
30+
{
31+
"path": "./processing.py",
32+
"function": "cleanup",
33+
"args": [
34+
"{IN-PATH}",
35+
"{OUT-PATH}"
36+
],
37+
"output": "biocache.csv"
38+
}
39+
]
40+
},
41+
"conversion": {
42+
"mapColumnName": "ala-biocache mappings"
43+
},
44+
"update": {
45+
"type": "weekly",
46+
"day": "sunday",
47+
"time": 9,
48+
"repeat": 2
49+
}
50+
}
Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
import requests
2+
from pathlib import Path
3+
import lib.downloading as dl
4+
import logging
5+
from lib.secrets import secrets
6+
import lib.bigFiles as bf
7+
8+
def collectBiocache(queryParamters: dict, outputFilePath: Path) -> None:
9+
paramters = {
10+
"email": secrets.general.email,
11+
"emailNotify": False
12+
}
13+
14+
baseURL = "https://api.ala.org.au/occurrences/occurrences/offline/download?"
15+
url = dl.urlBuilder(baseURL, paramters | queryParamters)
16+
17+
response = requests.get(url)
18+
data = response.json()
19+
20+
statusURL = data["statusUrl"]
21+
totalRecords = data["totalRecords"]
22+
logging.info(f"Found {totalRecords} total records")
23+
24+
dl.asyncRunner(statusURL, "status", "finished", "downloadUrl", outputFilePath)
25+
26+
def cleanup(folderPath: Path, outputFilePath: Path) -> None:
27+
extraFiles = [
28+
"citation.csv",
29+
"headings.csv",
30+
"README.html"
31+
]
32+
33+
for fileName in extraFiles:
34+
path = folderPath / fileName
35+
path.unlink(missing_ok=True)
36+
37+
bf.combineDirectoryFiles(outputFilePath, folderPath)
38+
39+
# status = {
40+
# "inQueue": [
41+
# "totalRecords",
42+
# "queueSize",
43+
# "statusUrl",
44+
# "cancelUrl",
45+
# "searchUrl"
46+
# ],
47+
# "running": [
48+
# "totalRecords",
49+
# "records",
50+
# "statusUrl",
51+
# "cancelUrl",
52+
# "searchUrl"
53+
# ],
54+
# "finished": [
55+
# "totalRecords",
56+
# "queueSize",
57+
# "downloadUrl",
58+
# "statusUrl",
59+
# "cancelUrl",
60+
# "searchUrl"
61+
# ]
62+
# }
File renamed without changes.

0 commit comments

Comments
 (0)