dataseer-ml is a GROBID module aiming at identifying implicit mentions of datasets in a scientific article. These identified datasets are further classified in a hierarchy of dataset types, these data types being directly derived from MeSH. It is a back-end service used by the DataSeer-Web application. Most of the datasets discussed in scientific articles are actually not named, but these data are part of the disclosed scientific work and should be shared properly to meet the FAIR requirements.
The goal of this process is to further drive the authors of the article to the best research data sharing practices, i.e. to ensure that the dataset is associated with data availability statement, permanent identifiers and in general requirements regarding Open Science and reproducibility. This further process is realized by the dataseer web application which includes a GUI to be used by the authors, suggesting data sharing policies based on the predicted data types for each identified dataset.
The module can process a variety of scientific article formats, including mainstream publisher's native XML submission formats: PDF, TEI, JATS/NLM, ScholarOne, BMJ, Elsevier staging format, OUP, PNAS, RSC, Sage, Wiley, etc.
.docx
format is also supported in a GROBID specific branch, but not yet merged.
The processing of an article follows 5 steps:
- Given an article to be processed by DataSeer:
1.1. if the format is PDF or docx, the document is first parsed and structured automatically by Grobid. This includes metadata extraction and consolidation against CrossRef and PubMed, structuring the text body and bibliographical references.
1.2. if the format is an publisher XML format (see Pub2TEI for the list of supported XML formats, e.g. TEI, JATS/NLM, ScholarOne, BMJ, Elsevier staging format, OUP, PNAS, RSC, Sage, Wiley, etc.), Pub2TEI converts the XML to the same customised structured TEI representation as GROBID.
-
The document body is then segmented into sentences thanks to the Pragmatic Segmenter or OpenNLP, with some customization to better support scientific texts (i.e. avoiding wrong sentence break in the middle of reference callout or in the middle of scientific notations, and taking into account section and paragraph breaks as identified in the structure recognition step in GROBID).
-
Each sentence is going through a cascade of text classifiers, typically all based on a fine-tuned SciBERT deep learning architecture integrated in Java via DeLFT and JEP, to predict if the sentence introduce a dataset, and if yes, which dataset type and sub type is introduced.
-
The text body is then processed by a sequence labeling model which aims at recognizing the section relevant to dataset introductions and presentations (zoning of "data" sections). "Materials and Methods" for instance is a usual relevant section, but other sections might be relevant and the "Materials and Methods" sections can appeared with a variety of section headers and subsections not relevant. This sequence labelling process is realized currently by a CRF using various features including the predictions produced in the previous steps 3.
-
A final selection of the predicted datasets takes place for the sections identified as introducing potentially datasets, using the result of the sentence classification of step 3 for predicting additionally the type and subtype of the recognized datasets.
The result of the service is a TEI file representing the article, enriched with sentence boundaries and predicted data set information.
Above, the Fluorometry dataset class word cloud.
The DataSeer dataset covers:
-
all the dataset contexts from 2000 Open Access articles from PLOS, classified into the taxonomy of data types developed at the Dataseer ResearchDataWiki. It contains 13,777 manually classified/annotated sentences about datasets (in average 6.89 dataset contexts per article).
-
all the dataset contexts from 1000 very recent Open Access articles from PMC, with similar classification into the taxonomy. It contains 11,826 additional manually classified/annotated sentences about datasets.
After alignment with the actual content of the full article bodies (via Grobid) for the first set, the data is used for training the recognition of sections introducing datasets (so called "zoning" task implemented with CRF using the Wapiti library), a binary classifier (sentence mentioning a dataset or not) and the data type and subtype classifiers (SciBERT).
The total amount of annotated sentences presenting dataset is 25,603. The rest of the sentences of the 3,000 annotated articles can be used as negative examples via sampling techniques.
An additional "data reuse" model can also be trained to predict if an identified dataset is newly introduced or reused in the context of the article, using around 400 positive "reuse" examples (on the life science domain, our annotated data indicates only 3.6% of reused datasets amongs all dataset mentioned).
A docker image for the dataseer-ml
service can be used/built with the project Dockerfile. This is the simplest and preferred way to run the dataseer-ml service. The GPUs on your system will be automatically recognized an used, with fallback to CPU if no GPU available. Note that the system works okay with CPU only (10-30 seconds per article), but the runtime is obviously considerably better with GPU.
For offline processing (e.g. non interactive usage scenario), it is recommended to exploit parallelization as much as possible, because the service takes advantage of multi-threading with CPU-only or GPU. Concurrent processing of PDF, XML or text input should be adjusted to the server capabilities.
dataseer-ml service is available as a Docker image on docker HUB, pull the image (here the lastest available version, change the version number as needed) as follow:
docker pull grobid/dataseer:0.7.3
After pulling or building the Docker image, you can now run the dataseer-ml service as a container as follow:
docker run --rm --gpus all -it -p 8060:8060 --init grobid/dataseer:0.7.3
Javascript demo/console web app is then accessible at http://localhost:8060
. You can change the port mapping for the service at launch - for instance port :8080
as follow:
docker run --rm --gpus all -it -p 8080:8060 --init grobid/dataseer:0.7.3
The complete process is as follow:
- copy the
Dockerfile.dataseer
at the root of the GROBID installation:
~/grobid/dataseer-ml$ cp ./Dockerfile.dataseer ..
- from the GROBID root installation (
grobid/
), launch the docker build:
docker build -t grobid/dataseer:0.7.3 --build-arg GROBID_VERSION=0.7.2-SNAPSHOT --file Dockerfile.dataseer .
The Docker image build take several minutes, installing GROBID, dataseer-ml, a complete Python Deep Learning environment based on DeLFT and pre-trained embeddings downloaded from the internet and pre-compiled. The resulting image is very large, more than 16GB, in particular due to the contained embeddings, models and kilometers of Python libraries.
- you can now run the
dataseer-ml
service via Docker:
docker run --rm --gpus all -it -p 8060:8060 --init grobid/dataseer:0.7.3
The build image includes the automatic support of GPU when available on the host machine via the parameter --gpus all
(with automatic recognition of the CUDA version), with fall back to CPU if GPU are not available. The support of GPU is only available on Linux host machine.
The dataseer-ml
service is available at the default host/port localhost:8060
, but it is possible to map the port at launch time of the container, e.g. for port :8080
:
docker run --rm --gpus all -it -p 8080:8060 --init grobid/dataseer:0.7.3
For building locally dataseer-ml, first install GROBID:
git clone https://github.com/kermitt2/grobid
Install then dataseer-ml and move it as a sub-module of GROBID:
git clone https://github.com/dataseer/dataseer-ml
mv dataseer-ml grobid/
Install DeLFT:
git clone https://github.com/kermitt2/delft
Follow the installation described in the DeLFT documentation. If necessary, update the path to the DeLFT installation in the grobid.yaml
file located under grobid-home/config/grobid.properties
.
By default, the project can process scientific articles in PDF and TEI formats. To process JATS/NLM, scholarOne and a variety of other native publisher formats, Pub2TEI needs to be installed:
git clone https://github.com/kermitt2/Pub2TEI
If required, update the path to the Pub2TEI installation in the dataseer-ml.yaml
file located under resources/config/
:
# path to Pub2TEI repository as available at https://github.com/kermitt2/Pub2TEI
pub2teiPath: "../../Pub2TEI/"
Finally, copy the models under grobid-home
and build dataseer-ml:
cd grobid/dataseer-ml
./gradlew copyModels
./gradlew clean install
./gradlew appRun
Javascript demo/console web app is then accessible at http://localhost:8060
. From the console and the RESTfull services
tab, you can process chunk of text (select Process text Sentence
), process a complete PDF document (select Process PDF
), process a TEI document (select Process TEI
) or process an XML publisher native document (such as JATS - select Process JATS/NLM/...
) .
Upload a PDF document, extract its content and convert it into structured TEI (via GROBID), identify dataset introductory section, segment into sentences, identify sentence introducing a dataset and classify the dataset type. Return a TEI representation of the PDF, enriched with Dataseer information.
Example:
curl --form input=@./resources/samples/journal.pone.0198050.pdf localhost:8060/service/processDataseerPDF
Upload a TEI document, identify dataset introductory section, segment into sentences, identify sentence introducing a dataset and classify the dataset type. Return the TEI document enriched with Dataseer information. It is assumed that the input TEI document follows the Grobid customization (see here).
Example:
curl --form input=@./resources/samples/journal.pone.0198050.tei.xml localhost:8060/service/processDataseerTEI
Upload a publisher native XML format document, convert it into structured TEI (via Pub2TEI), identify dataset introductory section, segment into sentences, identify sentence introducing a dataset and classify the dataset type. Return a TEI representation of the PDF, enriched with Dataseer information.
Example:
curl --form input=@./resources/samples/journal.pone.0198050.xml localhost:8060/service/processDataseerJATS
See Pub2TEI for the exact list of supported formats.
Identify if the sentence introduces a dataset, if yes classify the dataset type. This service is offered for test and demonstration purposes. Use the document-level service for processing an article for a complete and realistic usage.
Example:
curl -X POST -d "text=This is a sentence." http://localhost:8060/service/processDataseerSentence
curl -GET --data-urlencode "text=This is a another sentence." http://localhost:8060/service/processDataseerSentence
The DataSeer client can access the json file specifying the datatypes supported by the classifers, together with metadata for each data type (description, best data sharing policy, link to the corresponding DataSeer Wiki page, etc.) with the following endpoint:
curl -GET localhost:8060/service/jsonDataTypes
This service triggers a web crawling of the DataSeer Wiki pages describing the supported data types. Metadata about each type are extracted (description, best data sharing policy, link to the corresponding DataSeer Wiki page, etc.) and a json datatype resource file is assembled and served to the client:
curl -GET localhost:8060/service/resyncJsonDataTypes
Form this source, training data is available in a tabular format with reference to Open Access articles. The following process will align these tabular data (introduced by parameter -Pcsv
) with the actual article content (JATS/NLM and PDF via GROBID) to create a full training corpus.
./gradlew annotated_corpus_generator_csv -Ppdf=PATH/TO/THE/FULL/TEXTS/PDF -Pfull=PATH/TO/THE/FULL/TEXTS/NLM/ -Pcsv=PATH/TO/THE/TABULAR/TRAINING/DATA -Pxml=PATH/WHERE/TO/WRITE/THE/ASSEMBLED/TRAINING/DATA
For instance:
./gradlew annotated_corpus_generator_csv -Ppdf=/mnt/data/resources/plos/pdf/ -Pfull=/mnt/data/resources/plos/nlm/ -Pcsv=/home/lopez/grobid/dataseer-ml/resources/dataset/dataseer/csv/ -Pxml=/home/lopez/grobid/dataseer-ml/resources/dataset/dataseer/corpus/
Some reports will be generated to describe the alignment failures.
The classifier models are relying on the DeLFT deep learning library, which is integrated in Grobid.
After assembling the training data, the classification models can be trained with the following command under the DeFLT project (curren version 0.3.1 of DeLFT):
cd delft
python3 delft/applications/dataseerClassifier.py train --architecture bert --transformer allenai/scibert_scivocab_cased
Possible architectures are documented in the DeLFT project.
For producing an evaluation (including n-fold cross evaluation), see the DeLFT documentation.
[To Be Completed]
This model is a sequence labeling model working at segment-level (e.g. sequence of segments and one label per segment).
Train with all available training data (default grobid-home path is grobid/grobid-home
so usually no need to indicate this parameter):
./gradlew train_dataseer -PgH=/path/grobid/home
Evaluation with a random split of the annotated data with a ratio of 0.9 (90% training, 10% evaluation):
./gradlew eval_dataseer_split -PgH=/path/grobid/home -Ps=0.9
10-fold cross-evaluation:
./gradlew eval_dataseer_nfold -PgH=/path/grobid/home -Pt=10
The dataset annotations performed with the DataSeer web application are stored directly in a TEI format. We have thus in the TEI document at the same time manually corrected dataset annotations and the exact contexts of mention of the dataset in the structured document. We can therefore add this data to the existing training data and retrain the models - the DataSeer web application being actually also a PDF-annotation tool for new creating training data.
To generate training data from the application, first indicate the connection information to access the DataSeer web API (file config.json
). A token corresponding to the curator
level user right is necessary (it can be generated from the DataSeer web application, in the account panel). Then indicate in the config file the usage names corresponding to annotators/curators that you wish to consider to retrieve annotated valid documents. The latest versions of the datasets of the documents modified by the list of indicated annotators/curators will be used as extracted training data.
By default the script outputs the data "valided" by the indicated annotators or curators (who is providing an expert validation on the manual annotations). If relevant, you can modify the script to apply other criteria of selection. Then use the script as follow:
> python3 app_document_converter.py --config my_config.json --output ~/tmp/
The command will produce 3 files in the cvs training data format:
-
binary.csv
for binary classifier (i.e.dataset
/no_dataset
) with negative sampling (the negative sampling rate can be adjusted with the variableMAX_NEGATIVE_EXAMPLES_FROM_SAME_DOCUMENT
) -
reuse.csv
for binary classifier (reuse/no_reuse) if reuse information is available -
multilevel.csv
give the data type and data subtype for data sentences
In addition, a CVS file containing all the previous fields and some complementary ones called extract_summary.csv
will also generated, not for the purposes of training, but for human review (it includes additional information not used for training the machine learning models, such as dataset names, dataset permanent identifier, etc.).
Finally, the corresponding TEI files will be exported and written in a subdirector corpus/
under the directory specified by the --output
parameter. These TEI files can then be used as such to retrain the dataset-relevant section identifier model (see previous section, these new TEI files needs to be copied under dataseer-ml/resources/dataset/dataseer/corpus/
).
This process enables in practice a continuous re-training of the 4 different ML models based on the decisions/corrections of the end-users of the application.
Here are some benchmarkings on the dataset recognition and data type classification tasks. Given the current sparsity of the training examples for some data types, only a subset of major data types can be predicted as of mid-2020.
The evaluated classification models are:
-
BiGRU
is a robust deep learning text classifier using two bidirectional GRU, -
bert-base-cased
is a fine-tuned BERT base model as made available by Google Research (BERT-Base, Cased, 12-layer, 768-hidden, 12-heads, 110M parameters), see here, and -
SciBERT (cased) is a BERT architecture trained on scientific literature by AI2, see here.
SciBERT provides almost always the best classification accuracy.
Reported scores are obtained with 10-fold cross-validation.
Tasks and evaluations:
- binary classifier task: predict if the sentence introduces or not a dataset.
Trained initially with 21,042 examples (approx. 55% positive, 45% negative).
Initial model comparison:
BiGRU
-----
precision recall f-score support
dataset 0.8619 0.9465 0.9022 1121
no_dataset 0.9198 0.8019 0.8568 858
bert-base-en
------------
precision recall f-score support
dataset 0.8466 0.9795 0.9082 1121
no_dataset 0.9663 0.7681 0.8558 858
SciBERT
-------
precision recall f-score support
dataset 0.9053 0.9233 0.9142 1108
no_dataset 0.8975 0.8743 0.8857 851
Balancing more realistically positive and negative in the training and evaluation set (approx. 30% positive, 70% negative):
SciBERT
-------
Evaluation on 3574 instances:
precision recall f-score support
dataset 0.8844 0.9428 0.9127 1136
no_dataset 0.9725 0.9426 0.9573 2438
Results (10-2020) after extending the training data to around 59,400 examples (approx. 30% positive, 70% negative):
SciBERT
-------
Evaluation on 5993 instances:
precision recall f-score support
dataset 0.9166 0.9664 0.9408 2320
no_dataset 0.9780 0.9445 0.9609 3673
Results (04-2022) using DeLFT 0.3.1 updated architecture based on TensorFlow 2.7, around 59,400 examples (approx. 30% positive, 70% negative):
SciBERT
-------
Evaluation on 5993 instances:
precision recall f-score support
dataset 0.9339 0.9560 0.9448 2320
no_dataset 0.9718 0.9573 0.9645 3673
Results (08-2023) using DeLFT 0.3.3, around 112,200 examples (approx. 30% positive, 70% negative):
SciBERT
-------
Evaluation on 11307 instances:
precision recall f-score support
dataset 0.9173 0.9428 0.9299 3881
no_dataset 0.9697 0.9556 0.9626 7426
- first level-taxonomy classification: given a sentence we evaluate if it introduces a high-level data type or no dataset. The first level dataset taxonomy contains a total of 29 data types which corresponds to MeSH classes, see the Dataseer ResearchDataWiki. In the following evaluation report, we keep zero prediction class for information. No prediction happens when there are too few examples in the training data for this data type, which is the case for around 2/3 of the data types. Best results are obtained with SciBERT, see the lower part. The model comparison is based on training data from the first set of 2000 articles:
BiGRU
-----
precision recall f-score support
Angiography 0.0000 0.0000 0.0000 1
Calorimetry 0.0000 0.0000 0.0000 1
Chromatography 0.6667 0.5455 0.6000 11
Coulombimetry 0.0000 0.0000 0.0000 0
Dataset 0.4828 0.6222 0.5437 45
Densitometry 0.0000 0.0000 0.0000 0
Digital Drople 0.0000 0.0000 0.0000 0
Electrocardiog 0.0000 0.0000 0.0000 3
Electroencepha 1.0000 1.0000 1.0000 2
Electromyograp 0.0000 0.0000 0.0000 3
Electrooculogr 0.0000 0.0000 0.0000 1
Electrophysiol 0.0000 0.0000 0.0000 0
Electroretinog 0.0000 0.0000 0.0000 0
Emission flame 0.0000 0.0000 0.0000 1
Flow cytometry 0.9444 0.8095 0.8718 21
Genetic Data 0.7879 0.6341 0.7027 41
Image 0.7875 0.8289 0.8077 152
Mass Spectrome 0.0000 0.0000 0.0000 4
Protein Data 0.0000 0.0000 0.0000 1
Real-Time Poly 0.8286 0.8788 0.8529 33
Sound data 0.0000 0.0000 0.0000 1
Spectrometry 0.7308 0.7917 0.7600 48
Spectrum Analy 0.0000 0.0000 0.0000 0
Spirometry dat 0.0000 0.0000 0.0000 0
Tabular data 0.8156 0.8048 0.8102 753
Video Recordin 0.0000 0.0000 0.0000 2
Voltammetry da 0.0000 0.0000 0.0000 1
X-Ray Diffract 0.0000 0.0000 0.0000 7
X-Ray fluoresc 0.0000 0.0000 0.0000 0
no_dataset 0.8459 0.8663 0.8560 830
bert-base-en
------------
precision recall f-score support
Angiography 0.0000 0.0000 0.0000 1
Calorimetry 0.0000 0.0000 0.0000 1
Chromatography 0.0000 0.0000 0.0000 11
Coulombimetry 0.0000 0.0000 0.0000 0
Dataset 0.7368 0.3111 0.4375 45
Densitometry 0.0000 0.0000 0.0000 0
Digital Drople 0.0000 0.0000 0.0000 0
Electrocardiog 0.0000 0.0000 0.0000 3
Electroencepha 0.0000 0.0000 0.0000 2
Electromyograp 0.0000 0.0000 0.0000 3
Electrooculogr 0.0000 0.0000 0.0000 1
Electrophysiol 0.0000 0.0000 0.0000 0
Electroretinog 0.0000 0.0000 0.0000 0
Emission flame 0.0000 0.0000 0.0000 1
Flow cytometry 0.9091 0.4762 0.6250 21
Genetic Data 0.5490 0.6829 0.6087 41
Image 0.7204 0.8816 0.7929 152
Mass Spectrome 0.0000 0.0000 0.0000 4
Protein Data 0.0000 0.0000 0.0000 1
Real-Time Poly 0.6667 0.9697 0.7901 33
Sound data 0.0000 0.0000 0.0000 1
Spectrometry 0.7049 0.8958 0.7890 48
Spectrum Analy 0.0000 0.0000 0.0000 0
Spirometry dat 0.0000 0.0000 0.0000 0
Tabular data 0.7670 0.8964 0.8267 753
Video Recordin 0.0000 0.0000 0.0000 2
Voltammetry da 0.0000 0.0000 0.0000 1
X-Ray Diffract 0.0000 0.0000 0.0000 7
X-Ray fluoresc 0.0000 0.0000 0.0000 0
no_dataset 0.9391 0.7988 0.8633 830
SciBERT
-------
precision recall f-score support
Calorimetry 0.0000 0.0000 0.0000 2
Chromatography 0.6000 1.0000 0.7500 6
Coulombimetry 0.0000 0.0000 0.0000 0
Densitometry 0.0000 0.0000 0.0000 0
Electrocardiog 0.5000 0.5000 0.5000 4
Electroencepha 0.0000 0.0000 0.0000 1
Electromyograp 0.0000 0.0000 0.0000 1
Electrooculogr 0.0000 0.0000 0.0000 0
Electrophysiol 0.0000 0.0000 0.0000 0
Electroretinog 0.0000 0.0000 0.0000 1
Emission Flame 0.0000 0.0000 0.0000 0
Flow Cytometry 0.9375 0.9375 0.9375 16
Genetic Data 0.6535 0.8354 0.7333 79
Image 0.7433 0.8968 0.8129 155
Mass Spectrome 0.9048 0.9048 0.9048 21
Protein Data 0.0000 0.0000 0.0000 3
Sound Data 0.0000 0.0000 0.0000 2
Spectrometry 0.7021 0.8919 0.7857 37
Spectrum Analy 0.0000 0.0000 0.0000 0
Systematic Rev 0.0000 0.0000 0.0000 2
Tabular Data 0.8524 0.8620 0.8571 797
Video Recordin 0.0000 0.0000 0.0000 5
Voltammetry Da 0.0000 0.0000 0.0000 0
X-Ray Diffract 0.0000 0.0000 0.0000 4
no_dataset 0.9685 0.9463 0.9573 2438
Extending the training data to 3000 articles (10-2020):
SciBERT
-------
Total 47669 instances
precision recall f-score support
calorimetry 0.0000 0.0000 0.0000 1
chromatography 0.7532 0.8056 0.7785 72
coulombimetry 0.0000 0.0000 0.0000 0
densitometry 0.0000 0.0000 0.0000 0
electrocardiog 0.0000 0.0000 0.0000 5
electroencepha 0.8000 0.8000 0.8000 5
electromyograp 0.0000 0.0000 0.0000 0
electrooculogr 0.0000 0.0000 0.0000 0
electrophysiol 0.0000 0.0000 0.0000 1
electroretinog 0.0000 0.0000 0.0000 1
emission flame 0.0000 0.0000 0.0000 0
flow cytometry 0.8971 0.8841 0.8905 69
genetic data 0.8259 0.9022 0.8623 184
image 0.8041 0.9105 0.8540 257
mass spectrome 0.6667 0.6562 0.6614 64
protein data 0.0000 0.0000 0.0000 0
sound data 1.0000 0.2500 0.4000 4
spectrometry 0.7544 0.8866 0.8152 97
spectrum analy 0.0000 0.0000 0.0000 0
systematic rev 0.0000 0.0000 0.0000 1
tabular data 0.8772 0.9087 0.8927 1588
video recordin 0.0000 0.0000 0.0000 3
voltammetry da 0.0000 0.0000 0.0000 1
x-ray diffract 0.0000 0.0000 0.0000 8
Results 08-2023,
SciBERT
-------
Total 63,912 instances
Evaluation on 6113 instances:
precision recall f-score support
calorimetry 0.6667 0.6667 0.6667 3
chromatography 0.8731 0.9141 0.8931 128
code software 0.6629 0.8227 0.7342 141
coulombimetry 0.0000 0.0000 0.0000 0
dataset re-use 0.0000 0.0000 0.0000 2
densitometry 0.0000 0.0000 0.0000 1
electrocardiog 0.7500 1.0000 0.8571 3
electroencepha 0.5000 1.0000 0.6667 2
electromyograp 0.5000 1.0000 0.6667 1
electrooculogr 0.0000 0.0000 0.0000 0
electrophysiol 0.0000 0.0000 0.0000 2
electroretinog 0.0000 0.0000 0.0000 0
emission flame 0.0000 0.0000 0.0000 0
flow cytometry 0.8471 0.7273 0.7826 99
genetic data 0.8529 0.8315 0.8421 279
image 0.7988 0.8133 0.8060 332
lab materials 0.7949 0.8757 0.8333 177
mass spectrome 0.6863 0.7071 0.6965 99
no_dataset 0.9519 0.9241 0.9378 2357
other 0.0000 0.0000 0.0000 0
protein data 0.5882 0.8333 0.6897 12
protocol 0.5294 0.3000 0.3830 30
sound data 0.8000 0.6667 0.7273 6
spectrometry 0.8462 0.8209 0.8333 134
spectrum analy 0.0000 0.0000 0.0000 2
systematic rev 0.0000 0.0000 0.0000 5
tabular data 0.8911 0.9063 0.8986 2274
video recordin 0.5714 0.4444 0.5000 9
voltammetry da 0.0000 0.0000 0.0000 1
x-ray diffract 0.7368 1.0000 0.8485 14
- second level taxonomy: for the first leval data types that can be predicted by the first level classifier, we build a serie of additional classifier to predict a second level data type, assuming a cascaded approach. See the Dataseer ResearchDataWiki for more details about the data types. Best results are obtained with SciBERT too, see lower part.
BiGRU
-----
Evaluation Chromatography subtypes
Evaluation on 12 instances:
precision recall f-score support
High Pressure Liq. 0.5833 1.0000 0.7368 7
Evaluation Genetic Data subtypes
Evaluation on 46 instances:
precision recall f-score support
High-Throughpu 0.7500 0.9000 0.8182 10
Sequence Analy 0.4118 1.0000 0.5833 14
Evaluation Image subtypes
Evaluation on 145 instances:
precision recall f-score support
Electrophoresi 0.7500 0.9600 0.8421 25
Microscopy 0.9577 0.9189 0.9379 74
Magnetic Reson 0.3158 0.7500 0.4444 8
nan 0.5714 0.6667 0.6154 18
Evaluation Mass Spectrometry subtypes
Evaluation on 5 instances:
precision recall f-score support
Gas Chromatogr 0.6000 1.0000 0.7500 3
Evaluation Spectrometry subtypes
Evaluation on 51 instances:
precision recall f-score support
Spectrophotome 0.5490 1.0000 0.7089 280
Evaluation Tabular data subtypes
Evaluation on 762 instances:
precision recall f-score support
nan 0.8148 0.8894 0.8505 470
Assay 0.7288 0.6719 0.6992 64
Fluorometry 0.9000 0.6000 0.7200 15
Sample Table 0.3889 0.1944 0.2593 36
Subject Data T 0.7687 0.6975 0.7314 162
bert-base-en
------------
Evaluation Chromatography subtypes
precision recall f-score support
High Pressure Liq. 0.5455 1.0000 0.7059 6
Evaluation Dataset subtypes
precision recall f-score support
Existing datas 0.9070 1.0000 0.9512 39
Evaluation Genetic Data subtypes
precision recall f-score support
High-Throughpu 0.4091 1.0000 0.5806 9
Sequence Analy 0.4375 0.4375 0.4375 16
Evaluation Image subtypes
precision recall f-score support
Computerized T 0.3500 0.7778 0.4828 9
Electrophoresi 0.9062 0.9667 0.9355 30
Microscopy 0.8158 1.0000 0.8986 62
Magnetic Reson 1.0000 0.0833 0.1538 12
nan 0.6667 0.2353 0.3478 17
Evaluation Mass Spectrometry subtypes
precision recall f-score support
Gas Chromatogr 0.5000 0.6667 0.5714 3
Evaluation Spectrometry subtypes
precision recall f-score support
Spectrophotome 0.6667 1.0000 0.8000 22
Evaluation Tabular data subtypes
precision recall f-score support
nan 0.8065 0.8789 0.8412 479
Assay 0.6711 0.6711 0.6711 76
Fluorometry 0.6129 0.8636 0.7170 22
Sample Table 0.5333 0.2222 0.3137 36
Subject Data T 0.7628 0.7212 0.7414 165
SciBERT
-------
Evaluation Chromatography subtypes
precision recall f-score support
High Pressure 0.6364 1.0000 0.7778 7
Evaluation Genetic Data subtypes
precision recall f-score support
Real-Time Poly 0.8333 1.0000 0.9091 30
High-Throughpu 0.9286 0.8667 0.8966 15
Sequence Analy 0.8235 0.7778 0.8000 18
Evaluation Image subtypes
precision recall f-score support
X-Ray Computed 0.8182 0.6923 0.7500 13
Electrophoresi 0.9286 1.0000 0.9630 26
Microscopy 0.9200 0.9718 0.9452 71
Evaluation Mass Spectrometry subtypes
precision recall f-score support
Liquid Chromat 0.5455 1.0000 0.7059 6
Evaluation Spectrometry subtypes
precision recall f-score support
Spectrophotome 0.5385 1.0000 0.7000 21
Evaluation Tabular data subtypes
precision recall f-score support
nan 0.8963 0.7840 0.8364 551
Assay 0.6477 0.7808 0.7081 73
Fluorometry 0.8125 0.8667 0.8387 15
Sample Table 0.4211 0.4444 0.4324 36
Subject Data T 0.6054 0.8766 0.7162 154
- "Data reuse" model: this model tries to predict if the mention dataset is newly introduced by the research study or the reuse of an existing dataset:
Results 04-2022 using DeLFT 0.3.1 updated architecture based on TensorFlow 2.7, 11,500 annotated sentences:
Evaluation on 1122 instances:
precision recall f-score support
no_reuse 0.9907 0.9871 0.9889 1083
reuse 0.6744 0.7436 0.7073 39
Results 08-2023 using DeLFT 0.3.3, 27,100 annotated sentences:
Evaluation on 2713 instances:
precision recall f-score support
no_reuse 0.9307 0.9473 0.9389 2183
reuse 0.7658 0.7094 0.7365 530
This is only relevant to examine the manually labeled training data, and optionally correct it via the existing web application (cool stuff: documents with datasets inputted via the web application has the same format as the training data, thus user-annotated documents via the web application dataseer-web
can be used for training by datasser-ml
). All the documents present in the local training data repository (after importing the training, see above) under dataseer-ml/resources/dataset/dataseer/corpus/
will be loaded via the dataseer web API.
cd scripts/
node loader.js
The following Python script converts the data type specification from the DataSeer Doku Wiki (http://wiki.dataseer.io) into a JSON representation used by the DataSeer Web application. In addition, it will use the training file(s) to inject counts for each datatype. These frequency information can are used by the DataSeer Web application to provide a default ranking of datatypes in the drop down menus when a datatype is assigned manually.
cd scripts/
pip3 install pandas beautifulsoup4
python3 converter.py ../resources/DataTypes.csv ../resources/DataTypes.json
Note that at the end of the DataSeer Doku Wiki conversion, the script will report data types in the training data inconsistent with the DataSeer Doku Wiki. Those data types must be reviewed and updated to be consistent with Wiki, and the machine learning models must be retrained with the updated training data to produce the new data types.
Author and contact: Patrice Lopez ([email protected])
The development of dataseer-ml was supported by a Sloan Foundation grant, see here
dataseer-ml is distributed under Apache2 license.