ARGA-Genomes
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 0 deletions b/‎.gitignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 25 additions & 22 deletions b/‎README.md‎
Lines changed: 25 additions & 22 deletions
diff --git a/‎config.toml‎
Lines changed: 11 additions & 5 deletions b/‎config.toml‎
Lines changed: 11 additions & 5 deletions
diff --git a/‎dataSources/42bp/genomeArk/processing.py‎ renamed to ‎dataSources/42bp/genomeArk/scripts/processing.py‎ b/‎dataSources/42bp/genomeArk/processing.py‎ renamed to ‎dataSources/42bp/genomeArk/scripts/processing.py‎
diff --git a/‎dataSources/afd/checklist/processing.py‎ renamed to ‎dataSources/afd/checklist/scripts/processing.py‎
Lines changed: 12 additions & 17 deletions b/‎dataSources/afd/checklist/processing.py‎ renamed to ‎dataSources/afd/checklist/scripts/processing.py‎
Lines changed: 12 additions & 17 deletions
diff --git a/‎dataSources/ala/avh/config.json‎
Lines changed: 0 additions & 21 deletions b/‎dataSources/ala/avh/config.json‎
Lines changed: 0 additions & 21 deletions
diff --git a/‎dataSources/ala/avh/processing.py‎
Lines changed: 0 additions & 26 deletions b/‎dataSources/ala/avh/processing.py‎
Lines changed: 0 additions & 26 deletions
diff --git a/‎dataSources/ala/biocache/config.json‎
Lines changed: 50 additions & 0 deletions b/‎dataSources/ala/biocache/config.json‎
Lines changed: 50 additions & 0 deletions
diff --git a/‎dataSources/ala/biocache/scripts/processing.py‎
Lines changed: 62 additions & 0 deletions b/‎dataSources/ala/biocache/scripts/processing.py‎
Lines changed: 62 additions & 0 deletions
diff --git a/‎dataSources/ala/lists/processing.py‎ renamed to ‎dataSources/ala/lists/scripts/processing.py‎ b/‎dataSources/ala/lists/processing.py‎ renamed to ‎dataSources/ala/lists/scripts/processing.py‎
@@ -35,6 +35,7 @@ dataSources/**/data
 dataSources/**/examples
 dataSources/**/crawlerProgress
 dataSources/**/*.json
+dataSources/**/*.toml
 !dataSources/**/config.json
 
 # Data Files #
@@ -52,6 +53,7 @@ dataSources/**/*.json
 .env
 *.DS_Store
 *.code-workspace
+secrets.toml
 
 # Cached Files #
 *__pycache__
 
@@ -2,30 +2,33 @@
 ARGA will allow researchers to easily discover and access genomic data for Australian species in one place, faciliatating research and informing decision making in areas including conservation, biosecurity and agriculture.
 
 ## ARGA Data
-This repo is for Python and related code for data ingestion and pre-ingestion munging, prior to loading DwCA data into the Pipelines workflow.
+This repo is for Python and related code for data ingestion and pre-ingestion munging, prior to loading data into arga-backend.
 
 ## Setup
-Set up can be initiated by being in the base directory and running `deploy.py`, which should create a virtual environment and add a link to the src folder required to run the scripts. Further are provided as part of that script after it completes.
-
-## Processing Data
-The configuration files in `dataSources/location/database` are how processing understands the procedure. To add a new source, you'll need to create a new sourceConfig file. To access that source, you can call one of the source processing tools with the syntax `location-database`. For example, to process the genbankSummary data source which is part of the ncbi, your source is called `ncbi-genbankSummary`, which is case sensitive.
-
-The tools currently available are:
- - listSources.py
- - newSource.py
- - purgeSource.py
- - download.py
- - process.py
- - convert.py
- - getFields.py
-
-`listSources` shows you a list of currently available sources.
-`newSource` creates a new source folder and basic config to kickstart adding a new source.
-`purgeSource` deletes a source that is no longer required.
-`download` is for simply downloading the data, which will vary based on your database type.
-`processing` is for running initial processing on files outlined in the source config.
-`convert` will then convert the old file to the new file mappings, as well as enrich and augment if required.
-`getFields` will read the pre-conversion file and give examples of field names and how they'll be mapped with the appropriate mapping file.
+Set up can be initiated by being in the base directory and running `deploy.py`, which should create a virtual environment and add a link to the src folder required to run the scripts. Further help is provided as part of that script after it completes. This process is platform independant.
+
+## Data Sources
+This pipeline works with config files in `dataSources` folder of the main directory. At the top level, folders exist for each of the currently available data locations, such as `ncbi`. Within each location folder a database folder exists, which further divides the location into separate databases. For example, the `ncbi` folder mentioned previously has a subfolder named `nucleotide`, which is the nucleotide data available from the ncbi data location. At this level a config file exists which outlines how the dataset should be run to produce a mapped output file ready for ingestion, as well as a scripts folder for storing specific database scripts, and where a data folder will be created during pipeline process to store data.
+
+In some cases the database is further divided into subsections, for situations such as an extremely large database that is best run in smaller sections, or where different types of databases are retrieved using identical methods but produce slightly different results depending on input. Subsections are listed in the `config.json` file and will cause created data files to be placed in a data folder within a subsection folder. For the previously used `nucleotide` database within the `ncbi` location, there are many subsections such as `invertebrate` which allow full processing of part of the entire database.
+
+In order to reference these locations/databases/subsections you can use the chain of (location)-(database)-(subsection). Continuing with the previous example, to access the `invertebrate` subsection of the ncbi's nucleotide database, you would refer to the source as `ncbi-nucleotide-invertebrate`. Additionally, you may omit the last section of any reference to attempt to process all referenced sections. This means that by referring to the nucleotide source as `ncbi-nucleotide` and omitting the subsection the pipeline will attempt to process ALL subsections, and if a database has no subsections this is how you would refer to it. By taking this a step further, you are able to just reference `ncbi` when using the available source related tools to attempt to process ALL databases and all of their subsequent subsections.
+
+## Tools
+All interaction with this pipeline are best done through the available command line tools. Many of these tools make use of the syntax mentioned above to interact with specific sources, although some tools have their own syntax, which you can discover by using the help flag (-h) when running any of the tools. A brief summary of the tools are;
+- newSource: Create a new source in the data sources folder, creating a barebones config file depending on the type of database provided.
+- listSoures: Print a list of currently available locations/databases/subsection.
+- purgeSource: Removes a source in the data sources folder and cleans it up.
+- download: Run the download process outlined in the config file.
+- process: Run the processing process outlined in the config file.
+- convert: Run the conversion process to remap and restructure data for ingestion.
+- package: Package up the converted file and the downloading/processing/converting metadata into a zip file.
+- update: Run download/process/convert/package sequentially, limited by the update information in the config
+- samplePreConversion: Collect a sample of the file that is to be converted.
+- sampleConversion: Collect a sample of the converted file.
+
+## Data Storage Redirection
+A global `config.toml` file exists in the base directory for general global settings. The overwrites section can be replicated within any level of the data sources folder and the deeper `config.json` files will use that config. For example, many of the databases are quite large and so modifying the `storage` overwrite allows all downloading/processing/conversion files to be placed in a new location. To do this for all databases within the `ncbi` location, a `config.toml` file can be created within the `ncbi` folder with the overwrites section of the global config but either a relative or absolute path defined as the value, and all the databases will respect it. If instead you only wanted the `nucleotide` database to put it's data in a different location, you could instead place `config.toml` file within the `nucleotide` folder. This would allow all other `ncbi` databases to place their data as normal (whatever is outlined in the global `config.toml`), but have the `nucleotide` database place it's downloading/processing/conversion data in a separate location.
 
 ## Issues repository
 - [List of issues](https://github.com/ARGA-Genomes/arga-data/issues)
@@ -1,9 +1,15 @@
 [folders]
-src = "./src" # Source folder for all python code
-dataSources = "./dataSources" # Location of all source related files
-logs = "./logs" # Location of all logging files
-package = "" # Location for packaged files to be put in, leave blank to leave in respective dataSource location
+src = "./src" # Source folder for all python code, cannot overwrite with local config files
+dataSources = "./dataSources" # Location of all source related files, cannot overwrite with local config files
+logs = "./logs" # Location of all logging files, cannot overwrite with local config files
+
+[overwrites]
+storage = "" # Location overwrite for source data including downloading/processing/conversion, leave blank to keep in respective dataSource location, new location will have dataSources folder structure
+package = "" # Location overwrite for packaged files to be put in, leave blank to leave in respective dataSource location
+
+[files]
+secrets = "./secrets.toml"
 
 [settings]
 logToConsole = true
-logLevel = "info" # Log levels: debug, info, warning, error, critical
+logLevel = "info" # Log levels: debug, info, warning, error, critical
@@ -3,9 +3,9 @@
 from pathlib import Path
 import pandas as pd
 from io import BytesIO
-from lib.bigFileWriter import BigFileWriter, Format
+from lib.bigFiles import DFWriter
 from bs4 import BeautifulSoup
-from lib.progressBar import SteppableProgressBar
+from lib.progressBar import ProgressBar
 import re
 import traceback
 
@@ -27,7 +27,7 @@ def __init__(self, rawData: dict):
         self.children = [EntryData(child) for child in rawData.get("children", [])]
 
 def retrieve(outputFilePath: Path):
-    writer = BigFileWriter(outputFilePath, "sections", "section")
+    writer = DFWriter(outputFilePath)
 
     checklist = "https://biodiversity.org.au/afd/mainchecklist"
     response = requests.get(checklist).text
@@ -38,17 +38,17 @@ def retrieve(outputFilePath: Path):
     kingdomData = [EntryData(kingdom) for kingdom in json.loads(response[start:end])]
     downloadChildCSVs(kingdomData, writer, [])
 
-    writer.oneFile(False)
+    writer.combine()
 
-def downloadChildCSVs(entryData: list[EntryData], writer: BigFileWriter, parentRanks: list[str]) -> None:
+def downloadChildCSVs(entryData: list[EntryData], writer: DFWriter, parentRanks: list[str]) -> None:
     for entry in entryData:
         content = getCSVData(entry.key)
         higherTaxonomy = parentRanks + [entry.rank]
         if content is not None:
             df = buildDF(content)
             # df["higher_taxonomy"] = ";".join(higherTaxonomy)
-            writer.writeDF(df)
-            print(f"Wrote file #{len(writer.writtenFiles)}", end="\r")
+            writer.write(df)
+            print(f"Wrote file #{writer.writtenFileCount()}", end="\r")
             continue
 
         # Content was too large to download
@@ -192,14 +192,10 @@ def enrich(filePath: Path, outputFilePath: Path) -> None:
 
         enrichmentPath = outputFilePath.parent / f"{rank}.csv"
         if not enrichmentPath.exists():
-            writer = BigFileWriter(enrichmentPath, rank, subfileType=Format.CSV)
-            writer.populateFromFolder(writer.subfileDir)
-            subfileNames = [file.fileName for file in writer.writtenFiles]
-
-            uniqueSeries = subDF["taxon_id"].unique()
-            uniqueSeries = [item for item in uniqueSeries if item not in subfileNames]
+            writer = DFWriter(enrichmentPath)
+            uniqueSeries = subDF["taxon_id"].unique()[writer.writtenFileCount():] # Do not repeat already completed chunks
 
-            bar = SteppableProgressBar(50, len(uniqueSeries), f"{rank} Progress")
+            bar = ProgressBar(len(uniqueSeries), f"{rank} Progress")
             for taxonID in uniqueSeries:
                 bar.update()
 
@@ -211,10 +207,9 @@ def enrich(filePath: Path, outputFilePath: Path) -> None:
                     print(traceback.format_exc())
                     return
 
-                recordDF = pd.DataFrame.from_records(records)
-                writer.writeDF(recordDF, taxonID)
+                writer.write(pd.DataFrame.from_records(records), taxonID)
 
-            writer.oneFile(False)
+            writer.combine()
 
         enrichmentDF = pd.read_csv(enrichmentPath, dtype=object)
         df = df.merge(enrichmentDF, "left", left_on=["taxon_id", "canonical_name"], right_on=["taxon_id", rank.lower()])
 
@@ -0,0 +1,50 @@
+{
+    "datasetID": "0007001",
+    "retrieveType": "script",
+    "downloading": {
+        "path": "./processing.py",
+        "function": "collectBiocache",
+        "args": [
+            {
+                "q": "*:*",
+                "fq": "(basis_of_record:\"PRESERVED_SPECIMEN\" OR basis_of_record:\"MATERIAL_SAMPLE\" OR basis_of_record:\"LIVING_SPECIMEN\" OR basis_of_record:\"MATERIAL_CITATION\")",
+                "disableAllQualityFilters": true,
+                "qualityProfile": "ALA",
+                "qc": "-_nest_parent_:*"
+            },
+            "{OUT-PATH}"
+        ],
+        "output": "biocache.zip"
+    },
+    "processing": {
+        "linear": [
+            {
+                "path": ".lib/zipping.py",
+                "function": "extract",
+                "args": [
+                    "{IN-PATH}",
+                    "{OUT-DIR}"
+                ],
+                "output": "biocache"
+            },
+            {
+                "path": "./processing.py",
+                "function": "cleanup",
+                "args": [
+                    "{IN-PATH}",
+                    "{OUT-PATH}"
+                ],
+                "output": "biocache.csv"
+            }
+        ]
+    },
+    "conversion": {
+        "mapColumnName": "ala-biocache mappings"
+    },
+    "update": {
+        "type": "weekly",
+        "day": "sunday",
+        "time": 9,
+        "repeat": 2
+    }
+}
@@ -0,0 +1,62 @@
+import requests
+from pathlib import Path
+import lib.downloading as dl
+import logging
+from lib.secrets import secrets
+import lib.bigFiles as bf
+
+def collectBiocache(queryParamters: dict, outputFilePath: Path) -> None:
+    paramters = {
+        "email": secrets.general.email,
+        "emailNotify": False
+    }
+
+    baseURL = "https://api.ala.org.au/occurrences/occurrences/offline/download?"
+    url = dl.urlBuilder(baseURL, paramters | queryParamters)
+
+    response = requests.get(url)
+    data = response.json()
+
+    statusURL = data["statusUrl"]
+    totalRecords = data["totalRecords"]
+    logging.info(f"Found {totalRecords} total records")
+
+    dl.asyncRunner(statusURL, "status", "finished", "downloadUrl", outputFilePath)
+
+def cleanup(folderPath: Path, outputFilePath: Path) -> None:
+    extraFiles = [
+        "citation.csv",
+        "headings.csv",
+        "README.html"
+    ]
+
+    for fileName in extraFiles:
+        path = folderPath / fileName
+        path.unlink(missing_ok=True)
+
+    bf.combineDirectoryFiles(outputFilePath, folderPath)
+
+# status = {
+#     "inQueue": [
+#         "totalRecords",
+#         "queueSize",
+#         "statusUrl",
+#         "cancelUrl",
+#         "searchUrl"
+#     ],
+#     "running": [
+#         "totalRecords",
+#         "records",
+#         "statusUrl",
+#         "cancelUrl",
+#         "searchUrl"
+#     ],
+#     "finished": [
+#         "totalRecords",
+#         "queueSize",
+#         "downloadUrl",
+#         "statusUrl",
+#         "cancelUrl",
+#         "searchUrl"
+#     ]
+# }