Skip to content

WangChanGLM 🐘 - The Multilingual Instruction-Following Model

License

Notifications You must be signed in to change notification settings

PyThaiNLP/WangChanGLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

c7e3eae · Nov 22, 2023
Mar 13, 2023
May 25, 2023
Jun 4, 2023
Apr 20, 2023
May 13, 2023
Nov 22, 2023
Apr 1, 2023
Apr 30, 2023
Jan 20, 2023
Jan 21, 2023
Apr 17, 2023
Jan 20, 2023
Apr 28, 2023
May 25, 2023
Jan 20, 2023

Repository files navigation

WangChanGLM 🐘 - The Multilingual Instruction-Following Model

Blog | Codes | Demo

WangChanGLM is a multilingual, instruction-finetuned Facebook XGLM-7.5B using open-source, commercially permissible datasets (LAION OIG chip2 and infill_dbpedia, DataBricks Dolly v2, OpenAI TL;DR, and Hello-SimpleAI HC3; about 400k examples), released under CC-BY SA 4.0. The models are trained to perform a subset of instruction-following tasks we found most relevant namely: reading comprehension, brainstorming, and creative writing. We provide the weights for a model finetuned on an English-only dataset (wangchanglm-7.5B-sft-en) and another checkpoint further finetuned on Google-Translated Thai dataset (wangchanglm-7.5B-sft-enth). We perform Vicuna-style evaluation using both humans and ChatGPT (in our case, gpt-3.5-turbo since we are still on the waitlist for gpt-4) and observe some discrepancies between the two types of annoators. All training and evaluation codes are shared under the Apache-2.0 license in our Github, as well as datasets and model weights on HuggingFace. In a similar manner to Dolly v2, we use only use open-source, commercially permissive pretrained models and datasets, our models are neither restricted by non-commercial clause like models that use LLaMA as base nor non-compete clause like models that use self-instruct datasets from ChatGPT. See our live demo here.

Models

We provide various versions of our models as follows:

Sharded versions used in demo:

Training Sets

We provide our training sets as follows:

Finetuning

Multi-world LoRA

We finetuned XGLM-7.5B on 4 V100 GPU (32GB VARM) with the hyperparameters described in script/train_sft_peft_multi_world.py.

python -m torch.distributed.launch --nproc_per_node=4 train_sft_peft_multi_world.py \
--per_device_train_batch_size 1 --gradient_accumulation_steps 32 \ #effective batch size = 128 (4 GPUs * 1 batch size * 32 gradient accumulation)
--wandb_project your_project_name \
--model_name facebook/xglm-7.5B \
--dataset_name pythainlp/final_training_set_v1 \ 
--adapter_name save_adapter_to

The adapter is merged to the main weights with the script from lvwerra/trl.

Single-world LoRA

It is possible to finetune XGLM-7.5B on a single 32GB-VRAM GPU or multiple GPUs with a smaller VRAM with the hyperparameters described in script/train_sft_peft_single_world.py.

python train_sft_peft_single_world.py \
--per_device_train_batch_size 2 --gradient_accumulation_steps 64 \ #effective batch size = 128 (1 GPU * 2 batch size * 64 gradient accumulation)
--wandb_project your_project_name \
--model_name facebook/xglm-7.5B \
--dataset_name pythainlp/final_training_set_v1 \ 
--adapter_name save_adapter_to

Full-finetuning

We also provide a script for full finetuning we experimented with a smaller model on a different set of training data.

python -m torch.distributed.launch --nproc_per_node=8 train_sft.py \
--per_device_train_batch_size=8 --per_device_eval_batch_size=8 --gradient_accumulation_steps=16 \
--model_name=facebook/xglm-1.7B --bf16 --deepspeed=../config/sft_deepspeed_config.json

Inference

We performed inference on the OpenAssistant prompts using hyperparameters described in script/generate_huggingface_answer.py.

python generate_huggingface_answer.py --input_fname ../data/oasst1_gpt35turbo_answer.csv \
--model_name pythainlp/wangchanglm-7.5B-sft-en \
--tokenizer_name pythainlp/wangchanglm-7.5B-sft-en \
--output_fname ../data/oasst1_wangchang_sft_en_only_answer_answer.csv 

Evaluation

We evaluated any pair of model answers using gpt-3.5-turbo as described in script/eval_vicuna_style.py. The entire inference and evaluation is stored in script/infer_and_eval.sh. The human questionnaires are stored in data/human_questionnaire.

Environmental Impact

Experiments were conducted using a private infrastructure, which has a carbon efficiency of 0.432 kgCO2eq/kWh. A cumulative of 500 hours of computation was performed on hardware of type Tesla V100-SXM2-32GB (TDP of 300W). Total emissions are estimated to be 64.8 CO2eq of which 0 percents were directly offset. Estimations were conducted using the MachineLearning Impact calculator presented in lacoste2019quantifying.

Bibtex

@software{charin_polpanumas_2023_7878101,
  author       = {Charin Polpanumas and
                  Wannaphong Phatthiyaphaibun and
                  Patomporn Payoungkhamdee and
                  Peerat Limkonchotiwat and
                  Lalita Lowphansirikul and
                  Can Udomcharoenchaikit and
                  Titipat Achakulwisut and
                  Ekapol Chuangsuwanich and
                  Sarana Nutanong},
  title        = {{WangChanGLM🐘 — The Multilingual Instruction- 
                   Following Model}},
  month        = apr,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {v0.1},
  doi          = {10.5281/zenodo.7878101},
  url          = {https://doi.org/10.5281/zenodo.7878101}
}

Acknowledgements

We would like to thank Huggingface for the open-source infrastructure and ecosystem they have built, especially lvwerra of the trl repository. We give our appreciation to the open-source finetuning pioneers that come before us including but not limited to Alpaca, Alpaca-LoRA, GPT4All, OpenAssistant, Koala, Vicuna, and Dolly.

License

The source code is licensed under the Apache-2.0 license. The model weights are licensed under CC-BY-SA 4.0. Finetuning datasets are sourced from LAION OIG chip2 and infill_dbpedia (Apache-2.0), DataBricks Dolly v2 (Apache-2.0), OpenAI TL;DR (MIT), and Hello-SimpleAI HC3 (CC-BY SA).