Releases: bio-guoda/preston
0.9.0
Features
- introduce index for lookup of content associated with anchored proven… …ance graph for faster streaming of Zenodo metadata from Zotero and RIS records.
- enable translation of RIS metadata into Zenodo metadata to help support creation of BHL corpora #297
For example usage, see https://github.com/jhpoelen/bhl-corpus-tracker or below
# first track BHL item.txt and RIS metadata
preston track --algo md5\
"https://biodiversitylibrary.org/data/part.txt"\
"https://www.biodiversitylibrary.org/data/RIS/bhlpart.ris.zip"
# then track an associated pdf
preston track --algo md5\
https://www.biodiversitylibrary.org/partpdf/1
# then generate associated Zenodo metadata using
preston ls --algo md5\
| preston ris-stream
Improvements
- TAXODROS detect year pattern in the publication year field to avoid extra char… …acters like commas. TaxoDros/TaxoDros.github.io#41 @myrmoteras
- MfN initial support for listing images embedded in excel spreadsheets; re……lated to darktaxon/darktaxon#7 @myrmoteras @asrivathsan
- MfN towards capturing links between stacked images and their raw image or… …igins; related to darktaxon/darktaxon#5 @myrmoteras @asrivathsan
- MfN towards supporting thumbnail generation from TIFF images; related to #217 @myrmoteras @asrivathsan
- add default author for RIS records without stated authors; related to #299 fyi @myrmoteras @tcatapano
Bugs
n/a
0.8.6
Features
- enable translation of Zotero metadata into Zenodo metadata related to #286
- enable pointing to individual Zotero records related to #287
Improvements
- allow for streaming composite content ids via [preston cat]; related to #288
- do not resolve malformed or non-file URIs; related to #291
- make version pattern matcher a little more specific; related to #292
- when tracking github issues, only track associated github assets (e.g… …., images) and linked files. #295
- track meta.xml for BHL items; related to #296
Bugs
n/a
0.8.5
Features
- towards tracking of Zotero literature groups; related to #281 @myrmoteras @ajacsherman
example usage:
ZOTERO_TOKEN=[SECRET] preston track https://www.zotero.org/groups/5435545/bat_literature_project
For more usage examples, see [1], https://github.com/bat-literature/bat-literature.github.io and https://bat-literature.github.io .
Example usage to track and copy pdf associated with a google doc with provenance data stored in data/
folder:
preston track "https://docs.google.com/document/d/1LMnC0lUw_DGIQV5Pa4lZhe_7-SIR-otgHSHDNkAwD7Q/edit"\
| grep pdf\
| grep hasVersion\
| preston cat\
> doc.pdf
Improvements
- implement workaround inconsistent TaxonWorks API endpoint SpeciesFileGroup/taxonworks#3940 @mjy . Also see [2] .
Bugs
n/a
References
[1] Sherman AC et al. (2024) Bat Literature Corpus v0.1. https://github.com/bat-literature/bat-literature.github.io https://bat-literature.github.io https://linker.bio/hash://sha256/6ba3d79cf1fd6349012cb4e527b6727b3e41e140489fa9c02f132e2cdd88d189
[2] Poelen, J. H. (2024). A biodiversity dataset graph: Biological Associations in TaxonWorks hash://sha256/e4a47c067d6c125da60c9a1b92b5eecdea539cb8666cd3aed99db347ae5b8ed0 hash://md5/686007de79cc2a49ab23fd3debe56e3f (0.3) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.11151783
0.8.4
0.8.3
Features
n/a
Improvements
- improved support for streaming TaxoDros records in jsonl for Zenodo publication #275 also related to TaxoDros/TaxoDros.github.io#18 fyi @myrmoteras @slint - added Zenodo keywords and biodiversity related custom terms. Note the tags "keywords" and "custom" in the example metadata record for TaxoDros item shown below.
{
"metadata": {
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "taxodros-dros5",
"keywords": [
"Biodiversity",
"Taxonomy",
"fruit flies",
"flies",
"Animalia",
"Arthropoda",
"Insecta",
"Diptera"
],
"custom": {
"dwc:kingdom": [
"Animalia"
],
"dwc:phylum": [
"Arthropoda"
],
"dwc:class": [
"Insecta"
],
"dwc:order": [
"Diptera"
]
},
"referenceId": "abd el-halim et al., 2005",
"related_identifiers": [
{
"relation": "isAlternateIdentifier",
"identifier": "urn:lsid:taxodros.uzh.ch:id:abd%20el-halim%20et%20al.%2C%202005"
},
{
"relation": "isDerivedFrom",
"identifier": "https://linker.bio/line:hash://md5/ff86b940567d278e50fa00672cf96629!/L1-L10"
},
{
"relation": "isDerivedFrom",
"identifier": "10.5281/zenodo.10723540"
},
{
"relation": "isPartOf",
"identifier": "https://www.taxodros.uzh.ch"
},
{
"relation": "isAlternateIdentifier",
"identifier": "hash://md5/639988a4074ded5208a575b760a5dc5e"
}
],
"creators": [
{
"name": "Abd El-Halim, A.S."
},
{
"name": "Mostafa, A.A."
},
{
"name": "Allam, K.A.M.a."
}
],
"access_right": "restricted",
"publication_date": "2005",
"title": "Dipterous flies species and their densities in fourteen Egyptian governorates.",
"publication_type": "article",
"journal_title": "Journal of the Egyptian Society of Parasitology",
"journal_volume": "35",
"journal_pages": "351-362",
"taxodros:method": "ocr",
"http://www.w3.org/ns/prov#wasDerivedFrom": "line:hash://md5/ff86b940567d278e50fa00672cf96629!/L1-L10",
"references": [
"Bächli, G. (2024). TaxoDros - The Database on Taxonomy of Drosophilidae hash://md5/26a67012dde325cf2a3a058cc2f9c1b8 hash://sha256/ca86d74b318a334bddbc7c6a387a09530a083b8617718f5369ad548744c602d3 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10723540"
],
"filename": "Abd El-Halim et al., 2005.pdf",
"upload_type": "publication",
"communities": [
{
"identifier": "taxodros"
},
{
"identifier": "biosyslit"
}
],
"description": "Uploaded by Plazi for TaxoDros. We do not have abstracts."
}
}
Bugs
n/a
References
Bächli, G. (2024). TaxoDros - The Database on Taxonomy of Drosophilidae hash://md5/26a67012dde325cf2a3a058cc2f9c1b8 hash://sha256/ca86d74b318a334bddbc7c6a387a09530a083b8617718f5369ad548744c602d3 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10723540
0.8.2
Features
- added support for streaming metadata into Zenodo records, creating or updating when needed. Note that provided json Zenodo metadata should be presented in line-json: one json object per line.
Example Usage
cat metadata.json\
| jq -c .\
| preston track\
| grep hasVersion\
| preston zenodo\
--endpoint https://sandbox.zenodo.org\
--access-token [your access token]
Where jq -c .
ensures that json is in line-json, preston track
versions the piped json, grep hasVersion
only grabs the tracked content, not their previous version, and preston zenodo
attempts to update Zenodo records extracted from versioned content.
Note that if metadata.json has an alternate identifier in it that is a content id, then that content will be included as a file. Also, the preston zenodo
command will emit RDF statement indicative of the associated Zenodo record and related identifiers.
here's a sample snippet -
<urn:uuid:190939b5-5d59-45f4-a913-9666037eac8d> <http://purl.org/dc/terms/description> "An activity that creates or updates Zenodo records."@en <urn:uuid:190939b5-5d59-45f4-a913-9666037eac8d> .
<https://sandbox.zenodo.org/records/31836> <http://www.w3.org/ns/prov#wasDerivedFrom> <line:hash://sha256/cb94e7c16a617a56a55fbbd76c458333111053bc501d52ae34548b35967933b2!/L25> <urn:uuid:190939b5-5d59-45f4-a913-9666037eac8d> .
<https://sandbox.zenodo.org/records/31836> <http://www.w3.org/ns/prov#wasDerivedFrom> <https://linker.bio/line:hash://md5/ff86b940567d278e50fa00672cf96629!/L175241-L175251> <urn:uuid:190939b5-5d59-45f4-a913-9666037eac8d> .
<https://sandbox.zenodo.org/records/31836> <http://www.w3.org/ns/prov#wasDerivedFrom> <10.5281/zenodo.10593902> <urn:uuid:190939b5-5d59-45f4-a913-9666037eac8d> .
<https://sandbox.zenodo.org/records/31836> <http://www.w3.org/ns/prov#alternateOf> <hash://md5/96ee875c6d473e0095ccc6384fbebb1c> <urn:uuid:190939b5-5d59-45f4-a913-9666037eac8d> .
<https://sandbox.zenodo.org/records/31836> <http://www.w3.org/ns/prov#alternateOf> <urn:lsid:taxodros.uzh.ch:id:toda%2C%201985a> <urn:uuid:190939b5-5d59-45f4-a913-9666037eac8d> .
<https://sandbox.zenodo.org/records/31836> <http://purl.org/pav/lastRefreshedOn> "2024-02-28T20:56:56.969Z"^^<http://www.w3.org/2001/XMLSchema#dateTime> <urn:uuid:190939b5-5d59-45f4-a913-9666037eac8d> .
Example of "pretty printed" of metadata line-json file, note hash://md5/639988a4074ded5208a575b760a5dc5e
and "filename": "Abd El-Halim et al., 2005.pdf"
.
{
"metadata": {
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "taxodros-dros5",
"referenceId": "abd el-halim et al., 2005",
"related_identifiers": [
{
"relation": "isAlternateIdentifier",
"identifier": "urn:lsid:taxodros.uzh.ch:id:abd%20el-halim%20et%20al.%2C%202005"
},
{
"relation": "isDerivedFrom",
"identifier": "https://linker.bio/line:hash://md5/ff86b940567d278e50fa00672cf96629!/L1-L10"
},
{
"relation": "isDerivedFrom",
"identifier": "10.5281/zenodo.10593902"
},
{
"relation": "isPartOf",
"identifier": "https://www.taxodros.uzh.ch"
},
{
"relation": "isAlternateIdentifier",
"identifier": "hash://md5/639988a4074ded5208a575b760a5dc5e"
}
],
"creators": [
{
"name": "Abd El-Halim, A.S."
},
{
"name": "Mostafa, A.A."
},
{
"name": "Allam, K.A.M.a."
}
],
"access_right": "restricted",
"publication_date": "2005",
"title": "Dipterous flies species and their densities in fourteen Egyptian governorates.",
"publication_type": "article",
"journal_title": "Journal of the Egyptian Society of Parasitology",
"journal_volume": "35",
"journal_pages": "351-362",
"taxodros:method": "ocr",
"http://www.w3.org/ns/prov#wasDerivedFrom": "line:hash://md5/ff86b940567d278e50fa00672cf96629!/L1-L10",
"references": [
"Bächli, G. (2024). TaxoDros - The Database on Taxonomy of Drosophilidae hash://md5/4fa9eeed1c8cff2490483a48c718df02 hash://sha256/e05466f33c755f11bd1c2fa30eef2388bf24ff7989931bae1426daff0200af19 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10593902"
],
"filename": "Abd El-Halim et al., 2005.pdf",
"upload_type": "publication",
"communities": [
{
"identifier": "taxodros"
},
{
"identifier": "biosyslit"
}
],
"description": "Uploaded by Plazi for TaxoDros. We do not have abstracts."
}
}
Improvements
- improved support for streaming TaxoDros records in jsonl #275 fyi @myrmoteras @slint @lnielsen
Bugs
n/a
References
Bächli, G. (2024). TaxoDros - The Database on Taxonomy of Drosophilidae hash://md5/26a67012dde325cf2a3a058cc2f9c1b8 hash://sha256/ca86d74b318a334bddbc7c6a387a09530a083b8617718f5369ad548744c602d3 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10723540
0.8.1
Features
Improvements
- improved support for streaming TaxoDros records in jsonl #275 fyi @myrmoteras @slint @lnielsen
- DOI extraction
- publication type inference (e.g., book, article, collection)
- parsing of publication volume, series, pages
- DROS3 record support
Example record includes:
{
"id": "aboim, 1945",
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "taxodros-dros3",
"keywords": [
"melanogaster 1",
"devel",
"egg",
"hist",
"fig"
],
"http://www.w3.org/ns/prov#wasDerivedFrom": "line:hash://sha256/efbba5753be41ce7a7fda25819e6c1e83ad1de6c195fba34faf279d3775605f3!/L31-L38"
}
and
{
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "taxodros-dros5",
"id": "aceituno et al., 2020",
"authors": "Aceituno-Medina, M., Ordonez, A., Carrasco, M., Montoya, P., & Hernandez, E.,",
"year": "2020",
"title": "Mass Rearing, Quality Parameters, and Bioconversion in Drosophila suzukii (Diptera: Drosophilidae) for Sterile Insect Technique Purposes.",
"type": "article",
"journal": "J. econ. Ent.",
"volume": "113",
"pages": "1097Ð1104",
"number": "3",
"doi": "10.1093/jee/toaa022",
"method": "ocr / doi:10.1093/jee/toaa022",
"http://www.w3.org/ns/prov#wasDerivedFrom": "line:hash://sha256/54c249d040b1414380b8a509004b04781ef3c62a12715b627cfa8401829eae65!/L147-L157",
"filename": "Aceituno et al., 2020.pdf"
}
- introduce -f/--file option for providing lists of filenames/URLs to be tracked #277
Example usage:
preston track --file <(echo https://example.org)
where <(echo https://example.org)
produces a file with a single line containing https://example.org
Bugs
n/a
References
Bächli, G. (2024). TaxoDros - The Database on Taxonomy of Drosophilidae hash://md5/4fa9eeed1c8cff2490483a48c718df02 hash://sha256/e05466f33c755f11bd1c2fa30eef2388bf24ff7989931bae1426daff0200af19 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10593902
0.8.0
Features
- initial pass at support for streaming TaxoDros DROS5.TEXT records in jsonl #275 fyi @myrmoteras @slint @lnielsen
Example - select first DROS5.TEXT literature record with DOI from Bächli, G. (2024)
preston \
cat hash://md5/1037a9c831005710dc9bf14ee9a2e053\
--remote https://zenodo.org\
--algo md5\
| preston taxodros-stream\
--remote https://zenodo.org\
--algo md5\
| grep DOI\
| head -n1\
| jq .
produces:
{
"id": "abram et al., 2022",
"authors": "Abram, P.K., et al.,",
"year": "2022",
"title": "A Coordinated Sampling and Identification Methodology for Larval Parasitoids of Spotted-Wing Drosophila.",
"journal": "J. econ. Ent., 115(4):922Ð942.",
"doi": "10.1093/jee/toab237",
"method": "ocr++ / DOI:10.1093/jee/toab237",
"filename": "Abram et al., 2022.pdf",
"http://www.w3.org/ns/prov#wasDerivedFrom": "line:hash://md5/42be783197504a12172920a7edc7cbfd!/L120-L128",
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "taxodros-flatfile"
}
Improvements
- enable line selection in text files with Mac line endings #276 to enable
preston cat\
'line:hash://md5/42be783197504a12172920a7edc7cbfd!/L120-L128'\
--remote https://linker.bio,https://zenodo.org\
| tr '\r' '\n'
producing
.TEXT;
abram et al., 2022
.A Abram, P.K., et al.,
.J 2022
.S A Coordinated Sampling and Identification Methodology
for Larval Parasitoids of Spotted-Wing Drosophila.
.Z J. econ. Ent., 115(4):922�942.
.K ocr++ / DOI:10.1093/jee/toab237
.P Abram et al., 2022.pdf
Similar results can be obtained when requesting -
https://linker.bio/line:hash://md5/42be783197504a12172920a7edc7cbfd!/L120-L128
in a browser.
Bugs
n/a
References
Bächli, G. (2024). TaxoDros - The Database on Taxonomy of Drosophilidae hash://md5/d68c923002c43271cee07ba172c67b0b hash://sha256/3e41eec4c91598b8a2de96e1d1ed47d271a7560eb6ef350a17bc67cc61255302 [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10565403