在自然语言处理领域中,预训练语言模型(Pretrained Language Models)已成为非常重要的基础技术,本仓库主要收集了目前网上公开的一些高质量中文预训练模型(感谢分享资源的大佬),并将持续更新......
注: 🤗huggingface模型下载地址: 1. 清华大学开源镜像 2. 官方地址
- 2018 | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Jacob Devlin, et al. | arXiv |
PDF
- 2019 | Pre-Training with Whole Word Masking for Chinese BERT | Yiming Cui, et al. | arXiv |
PDF
模型 | 参数大小 | 语料大小 | TensorFlow | PyTorch | 提供者 | 源地址 | 应用领域 | 备注 |
---|---|---|---|---|---|---|---|---|
BERT-Base | 110M | 中文维基(词数0.4B) | Google Drive | Google Research | bert | 通用 | ||
BERT-wwm | 110M | 中文维基(词数0.4B) | Google Drive 讯飞云-密码07Xj | Google Drive | Yiming Cui | Chinese-BERT-wwm | 通用 | |
BERT-wwm-ext | 110M | 通用语料(词数5.4B) | Google Drive 讯飞云-密码4cMG | Google Drive | Yiming Cui | Chinese-BERT-wwm | 通用 | |
bert-base-民事 | 2654万民事文书 | 阿里云 | THUNLP | OpenCLaP | 司法 | |||
bert-base-刑事 | 663万刑事文书 | 阿里云 | THUNLP | OpenCLaP | 司法 | |||
BAAI-JDAI-BERT | 42G电商客服对话数据(词数9B) | 京东云 | JDAI | pretrained_models_and_embeddings | 电商客服对话 | |||
FinBERT | 400万金融领域数据(词数30亿) | Google Drive 百度网盘-密码1cmp | Google Drive 百度网盘-密码986f | Value Simplex | FinBERT | 金融科技领域 | ||
EduBERT | 2000万教育领域数据(词数3.8亿) | 好未来AI | tal-tech | tal-tech | edu-bert | 教育领域 | ||
WoBERT | 30通用语料+医学专业词典 | 百度网盘-密码kim2 | natureLanguageQing | Medical_WoBERT | 医学领域 | |||
MC-BERT | Google Drive | Alibaba AI Research | ChineseBLUE | 医学领域 | ||||
guwenbert-base | 古代文献语料(词数1.7B) | 百度网盘-密码4jng huggingface | Ethan | guwenbert | 古文领域 | |||
guwenbert-large | 古代文献语料(词数1.7B) | 百度网盘-密码m5sz huggingface | Ethan | guwenbert | 古文领域 |
备注:
[1] wwm全称为**Whole Word Masking **,一个完整的词的部分WordPiece子词被mask,则同属该词的其他部分也会被mask
[2] ext表示在更多数据集下训练
- 2019 | RoBERTa: A Robustly Optimized BERT Pretraining Approach | Yinhan Liu, et al. | arXiv |
PDF
- 2019 | ALBERT: A Lite BERT For Self-Supervised Learning Of Language Representations | Zhenzhong Lan, et al. | arXiv |
PDF
模型 | 参数大小 | 语料大小 | TensorFlow | PyTorch | 提供者 | 源地址 | 应用领域 | 备注 |
---|---|---|---|---|---|---|---|---|
Albert_base_zh | 12M | 通用语料30G | Google Drive | Google Drive | brightmart | albert_zh | 通用 | |
Albert_large_zh | 通用语料30G | Google Drive | Google Drive | brightmart | albert_zh | 通用 | ||
Albert_xlarge_zh | 通用语料30G | Google Drive | Google Drive | brightmart | albert_zh | 通用 | ||
Albert_base | 通用语料30G | Google Drive | Google Research | ALBERT | 通用 | |||
Albert_large | 通用语料30G | Google Drive | Google Research | ALBERT | 通用 | |||
Albert_xlarge | 通用语料30G | Google Drive | Google Research | ALBERT | 通用 | |||
Albert_xxlarge | 通用语料30G | Google Drive | Google Research | ALBERT | 通用 |
- 2019 | NEZHA: Neural Contextualized Representation for Chinese Language Understanding | Junqiu Wei, et al. | arXiv |
PDF
模型 | 参数大小 | 语料大小 | TensorFlow | PyTorch | 提供者 | 源地址 | 应用领域 | 备注 |
---|---|---|---|---|---|---|---|---|
NEZHA-base | Google Drive 百度网盘-密码ntn3 | lonePatient | HUAWEI Noah's Ark Lab | link | 通用 | |||
NEZHA-base-WWM | Google Drive 百度网盘-密码f68o | lonePatient | HUAWEI Noah's Ark Lab | link | 通用 | |||
NEZHA-large | Google Drive 百度网盘-密码7thu | lonePatient | HUAWEI Noah's Ark Lab | link | 通用 | |||
NEZHA-large-WWM | Google Drive 百度网盘-ni4o | lonePatient | HUAWEI Noah's Ark Lab | link | 通用 | |||
NEZHA-Gen | Google Drive 百度网盘-密码ytim | HUAWEI Noah's Ark Lab | link | 通用 | ||||
NEZHA-Gen | Google Drive 百度网盘-密码rb5m | HUAWEI Noah's Ark Lab | link | |||||
WoNEZHA | 30通用语料+医学专业词典 | 百度网盘-密码qgkq | natureLanguageQing | link | 医学领域 |
- 2020 | Revisiting Pre-Trained Models for Chinese Natural Language Processing | Yiming Cui, et al. | arXiv |
PDF
模型 | 参数大小 | 语料大小 | TensorFlow | PyTorch | 提供者 | 源地址 | 应用领域 | 备注 |
---|---|---|---|---|---|---|---|---|
MacBERT-base | 102M | 通用语料(词数5.4B) | Google Drive 讯飞云-密码E2cP | Yiming Cui | link | 通用 | ||
MacBERT-large | 324M | 通用语料(词数5.4B) | Google Drive 讯飞云-密码3Yg3 | Yiming Cui | link | 通用 |
- 2019 | XLNet: Generalized Autoregressive Pretraining for Language Understanding | Zhilin Yang, et al. | arXiv |
PDF
模型 | 参数大小 | 语料大小 | TensorFlow | PyTorch | 提供者 | 源地址 | 应用领域 | 备注 |
---|---|---|---|---|---|---|---|---|
XLNet-base | 117M | 通用语料(词数5.4B) | Google Drive 讯飞云-密码uCpe | Google Drive | Yiming Cui | link | 通用 | |
XLNet-mid | 209M | 通用语料(词数5.4B) | Google Drive 讯飞云-密码68En | Google Drive | Yiming Cui | link | 通用 | |
XLNet_zh_Large | 百度网盘 | brightmart | link | 通用 |
- 2020 | ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators | Kevin Clark, et al. | arXiv |
PDF
模型 | 参数大小 | 语料大小 | TensorFlow | PyTorch | 提供者 | 源地址 | 应用领域 | 备注 |
---|---|---|---|---|---|---|---|---|
ELECTRA-180g-large | Google Drive 讯飞云-密码Yfcy | Yiming Cui | link | 通用 | ||||
ELECTRA-180g-small-ex | Google Drive 讯飞云-密码GUdp | Yiming Cui | link | 通用 | ||||
ELECTRA-180g-base | Google Drive 讯飞云-密码Xcvm | Yiming Cui | link | 通用 | ||||
ELECTRA-180g-small | Google Drive 讯飞云-密码qsHj | Yiming Cui | link | 通用 | ||||
legal-ELECTRA-large | Google Drive 讯飞云-密码7f7b | Yiming Cui | link | 司法领域 | ||||
legal-ELECTRA-base | Google Drive 讯飞云-密码7f7b | Yiming Cui | link | 司法领域 | ||||
legal-ELECTRA-small | Google Drive 讯飞云-密码7f7b | Yiming Cui | link | 司法领域 |
- 2019 | ZEN: Pre-training Chinese Text Encoder Enhanced by N-gram Representations | Zhilin Yang, et al. | arXiv |
PDF
模型 | 参数大小 | 语料大小 | TensorFlow | PyTorch | 提供者 | 源地址 | 应用领域 | 备注 |
---|---|---|---|---|---|---|---|---|
ZEN-Base | Google Drive 百度网盘 | Sinovation Ventures AI Institute | link | 通用 |
-
2019 | ERNIE: Enhanced Representation through Knowledge Integration | Yu Sun, et al. | arXiv |
PDF
-
2020 | SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis | Hao Tian, et al. | arXiv |
PDF
模型 | 参数大小 | 语料大小 | PaddlePaddle | PyTorch | 提供者 | 源地址 | 应用领域 | 备注 |
---|---|---|---|---|---|---|---|---|
ernie-1.0-base | link | [nghuyong云盘](http://pan.nghuyong.top/#/s/y7Uz) | PaddlePaddle | link | 通用 | |||
ernie_1.0_skep_large_ch | link | Baidu | link | 情感分析 |
-
2019 | Improving Language Understandingby Generative Pre-Training | Alec Radford, et al. | arXiv |
PDF
-
2019 | Language Models are Unsupervised Multitask Learners | Alec Radford, et al. | arXiv |
PDF
模型 | 参数大小 | 语料大小 | TensorFlow | PyTorch | 提供者 | 源地址 | 应用领域 | 备注 |
---|---|---|---|---|---|---|---|---|
GPT2 | 15亿 | 30G | Google Drive 百度网盘-密码ffz6 | Caspar ZHANG | gpt2-ml | |||
GPT2 | 15亿 | 15G | Google Drive 百度网盘-密码q9vr | Caspar ZHANG | gpt2-ml | |||
CDial-GPTLCCC-base | 95.5M | LCCC-base | [huggingface]](https://huggingface.co/thu-coai/CDial-GPT_LCCC-base) | thu-coai | CDial-GPT | |||
CDial-GPT2LCCC-base | 95.5M | LCCC-base | [huggingface]](https://huggingface.co/thu-coai/CDial-GPT2_LCCC-base) | thu-coai | CDial-GPT | |||
CDial-GPTLCCC-large | 95.5M | LCCC-large | [huggingface]](https://huggingface.co/thu-coai/CDial-GPT_LCCC-large) | thu-coai | CDial-GPT | |||
GPT2-dialogue | 常见中文闲聊 | Google Drive 百度网盘-密码osi6 | yangjianxin1 | GPT2-chitchat | ||||
GPT2-mmi | 50w中文闲聊语料 百度网盘-密码jk8d GoogleDrive | Google Drive 百度网盘-密码1j88 | yangjianxin1 | GPT2-chitchat | ||||
GPT2-散文模型 | 130MB散文数据集 | Google Drive 百度网盘-密码fpyu | Zeyao Du | GPT2-Chinese | ||||
GPT2-诗词模型 | 180MB古诗词数据集 | Google Drive 百度网盘-密码7fev | Zeyao Du | GPT2-Chinese | ||||
GPT2-对联模型 | 40MB对联数据集 | Google Drive 百度网盘-密码i5n0 | Zeyao Du | GPT2-Chinese |