-
Notifications
You must be signed in to change notification settings - Fork 2
Description
This is all a bit of a mess, there is definitely a better way to do it.
write_citation_pairs
takes a data frame with a column for article id and one for dataset id. It loops through each row and uses crossref::cr_cn
to retrieve a full citation for the paper using the article id. We need the information such as authors, title, etc to send to the metrics service.
crossref::cr_cn
returns the citation in bibtex format (it can also return json and other formats, optionally). Then, that bibtex is passed to bib2df:bib2df
, which parses the text string into a data frame. Parsing this text string is somewhat of a nightmare though, and I ended up refactoring bib2df to accommodate single line bibtex docs, which for some reason crossref::cr_cn
started returning. So I did that here, but the method that I had to use requires that you know what the fields are for the bibtex entry are. Occasionally, a bibtex entry will come back with a really oddball field in it, and that field name has to be passed to the extra_fields
argument of bib2df
and the function run again to get the correct parsing, otherwise the rest of the document is thrown off. This is all especially frustrating because we only need certain fields to pass to the metrics service, but the ENTIRE doc needs to be processed correctly.
So some options to make this require no human intervention:
- Capture the warning output from the first pass, parse it, feed the fields back in for a second pass
- this seems ridiculous
- Have crossref::cr_cn just return the json, parse it, and extract what we need, bypassing bib2df entirely
- Find a more straightforward way to retrieve just the information we need, probably by querying the crossref API more directly