Simplifying Scholarly Abstracts for Accessible Digital Libraries

This repository accompanies our manuscript Simplifying Scholarly Abstracts for Accessible Digital Libraries submitted to JCDL2024.

Demo

Play with our models reported in the manuscript on Colab.

Results

The generations from different models are hosted in the folder eval_results_temp_0.01 with informative names. Using a temperature of 0.01 across generations is only to reduce possible technical problems in decoding and is virtually equivalent to a temperature of 0, as described in the manuscript.

Corpus

Due to copyright restrictions, we cannot share the Scientific Abstract-Significance Statement (SASS) corpus publicly. Please feel free to contact us for access to the corpus for academic use. To examine the corpus statistics, you can run the script corpus_stats.py after unzipping the corpus file to the folder resources.

Models

Below are links to the fine-tuned models hosted on Hugging Face hubs:

OLMo-1B-SFT-SASS
Gemma-2B-SFT-SASS (Note, Gemma-2B requires permission from Google to use. See License.)
Phi-2-SFT-SASS (Note, not suggested for use in practice due to its performance)

See our Demo for use.

Reproduction

Be sure to reproduce our environment first with:

python3.11 -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txt

Training

You can use the script sft.py to train your own models. Here is an example we used for training OLMo-1B using the SASS corpus on a single Nvidia A40 GPU:

python -m sft --model olmo-1b --per_device_train_batch_size 4

See more examples in the runs folder with script names prefixed with sft_.

Evaluation

Download our checkpoints from Zenodo and unzip them into the project folder. A typical checkpoint folder has a name like ckpts/sft_OLMo-1B-hf/checkpoint-940. Then you can use the script eval_outputs.py to rerun the generation and evaluation. Here is an example we used for evaluating Gemma-2B's performance:

python -m eval_outputs --model gemma-2b --temperature 0.01

See more examples in the runs folder with script names prefixed with eval_.

Word Accessibility Estimator

We reproduced the model used by Riddell & Igarashi (2021) using the English Wikipedia corpus and provided the trained word accessibility estimator in pickle format at word_freq/wa_model.pkl. The estimator for an arbitrary English word will be loaded when running eval_outputs.py.

If you are interested in reproducing our word accessibility estimator, consider the following scripts in the created environment:

python -m calculate_token_frequency  # calculate ground truth from wiki_en
python -m estimate_token_frequency  # fit a ridge regression

Zero-shot Performance of OpenAI's Models

The outputs from GPT-3.5/GPT-4o and the logs are hosted in the folder eval_results_temp_0.01. You need to supply your own OPENAI_API_KEY before running the script eval_openai_models.py to reproduce the results.

License

Our scripts are under the 0BSD license. OLMo-1B is licensed under Apache-2.0, and Phi-2 is under MIT. Gemma-2B has its own license and requires permission from Google to use our fine-tuned Gemma-2B.

Contact

Haining Wang

Citation

@inproceedings{wang2024simplifying,
  author = {Haining Wang and Jason Clark},
  title = {Simplifying Scholarly Abstracts for Accessible Digital Libraries Using Language Models},
  booktitle = {The 2024 ACM/IEEE Joint Conference on Digital Libraries (JCDL '24)},
  year = {2024},
  location = {Hong Kong, China},
  publisher = {ACM},
  address = {New York, NY, USA},
  pages = {8},
  doi = {10.1145/3677389.3702490}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
eval_results_temp_0.01		eval_results_temp_0.01
runs		runs
word_freq		word_freq
README.md		README.md
abstract_metrics.csv		abstract_metrics.csv
calculate_token_frequency.py		calculate_token_frequency.py
corpus_stats.py		corpus_stats.py
estimate_token_frequency.py		estimate_token_frequency.py
eval_openai_models.py		eval_openai_models.py
eval_outputs.py		eval_outputs.py
requirements.txt		requirements.txt
sft.py		sft.py
train_wa_estimator.py		train_wa_estimator.py
upload.py		upload.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Simplifying Scholarly Abstracts for Accessible Digital Libraries

Demo

Results

Corpus

Models

Reproduction

Training

Evaluation

Word Accessibility Estimator

Zero-shot Performance of OpenAI's Models

License

Contact

Citation

About

Uh oh!

Uh oh!

Languages

Wang-Haining/scholarly_abstract_simplification

Folders and files

Latest commit

History

Repository files navigation

Simplifying Scholarly Abstracts for Accessible Digital Libraries

Demo

Results

Corpus

Models

Reproduction

Training

Evaluation

Word Accessibility Estimator

Zero-shot Performance of OpenAI's Models

License

Contact

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages