Skip to content

Commit 7201d03

Browse files
committed
202105 version
202105 version commit
1 parent b5bd2fa commit 7201d03

14 files changed

+124203
-36
lines changed

README-zh_CN.md

+48-9
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ CaCl2是开放项目CaOCl(CA开放中文词法分析工具包)重要组成
1515

1616
| 时间 | 总词条数 | 候选词条 | 已公开词条 | 预览版词条 |
1717
| :----: | :----: | :----: | :----: | :----: |
18-
| 2021-04-01 | 约21,000,000 | 约3,000,000 | 2,624,625 | 280,000 |
18+
| 2021-04-01 | 约21,000,000 | 约3,000,000 | 3,279,518 | 280,000 |
1919

2020
#### 2.行业字典数
2121
| 时间 | 行业 | 词典数 | 已公开 | 预览版 | 未公开 |
@@ -65,11 +65,11 @@ jieba.load_userdict(os.path.join(BASE_PATH_TO_DICT), dict_name))
6565
### 1.已开源
6666
| 行业代码 | 词库名称 | 词条数量 | 公开时间 | 当前版本 | 格式 | 下载地址 |
6767
| :----: | :---- | :----: | :----: | :----: | :----: | :----: |
68-
| 480000 | 银行-通用 | 40,612 | 2021-02 | v0.2 | txt | [480000.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/480000.zip) |
69-
| 480100 | 银行-银行 | 224,433 | 2021-02 | v0.2 | txt | [480100.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/480100.zip) |
70-
| 490000 | 非银金融-通用 | 353,149 | 2021-02 | v0.2 | txt | [490000.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/490000.zip) |
71-
| 490100 | 非银金融-证券 | 324,450 | 2021-02 | v0.2 | txt | [490100.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/490100.zip) |
72-
| 490200 | 非银金融-保险 | 31,020 | 2021-02 | v0.2 | txt | [480200.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/480200.zip) |
68+
| 480000 | 银行-通用 | 52,105 | 2021-02 | v0.2 | txt | [480000.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/480000.zip) |
69+
| 480100 | 银行-银行 | 232,434 | 2021-02 | v0.2 | txt | [480100.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/480100.zip) |
70+
| 490000 | 非银金融-通用 | 365,878 | 2021-02 | v0.2 | txt | [490000.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/490000.zip) |
71+
| 490100 | 非银金融-证券 | 338,428 | 2021-02 | v0.2 | txt | [490100.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/490100.zip) |
72+
| 490200 | 非银金融-保险 | 45,388 | 2021-02 | v0.2 | txt | [480200.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/480200.zip) |
7373

7474
### 2.计划开源
7575
| 行业代码 | 词库名称 | 词条数量 | 计划公开时间 | 当前版本 | 格式 | 下载地址 |
@@ -188,14 +188,53 @@ A股 今日 迎来 4 月 开门红 三大 指数 集体 收涨 其中
188188
### 2.指标和得分
189189
#### 2.1 行业数据集测试
190190
##### 2.1.1 金融行业(银行行业),分词测试
191-
![金融行业(银行行业),分词测试]()
191+
###### CaCl2银行词库分词(代码示例)
192+
```python
193+
import jieba
194+
dict_name = '480100.txt'
195+
jieba.load_userdict(dict_name)
196+
seg_list = jieba.cut(text, cut_all=False)
197+
print("cacl2: " + "/ ".join(seg_list))
198+
```
199+
![金融行业(银行行业)分词测试](https://github.com/limccn/cacl2/blob/master/docs/images/480100.png)
200+
201+
[详细分词测试结果地址](https://github.com/limccn/cacl2/docs/480100_cacl2_seg.txt)
192202
##### 2.1.2 金融行业(金融行业,不包含银行),分词测试
193-
![金融行业(金融行业,不包含银行),分词测试]()
203+
###### CaCl2金融标准词库分词(代码示例)
204+
```python
205+
import jieba
206+
dict_name = '490000.txt'
207+
jieba.load_userdict(dict_name)
208+
seg_list = jieba.cut(text, cut_all=False)
209+
print("cacl2: " + "/ ".join(seg_list))
210+
```
211+
![金融行业(金融行业,不包含银行)分词测试](https://github.com/limccn/cacl2/blob/master/docs/images/490000.png)
212+
213+
[详细分词测试结果地址](https://github.com/limccn/cacl2/docs/490000_cacl2_seg.txt)
194214
#### 2.2 标准数据集测试
195215
##### 2.2.1 标准数据集Chinese Treebank(CTB5)上测试分词,[参考链接](https://www.cs.brandeis.edu/~clp/ctb/)
196216
![标准数据集CTB5上测试分词]()
197217
##### 2.2.2 标准数据集International Chinese Word Segmentation Bakeoff(ICWB2)上测试分词,[参考链接](http://sighan.cs.uchicago.edu/bakeoff2005/)
198-
![标准数据集ICWB2上测试分词]()
218+
ICWB2标准数据集上测试分词的评分结果:
219+
```
220+
=== SUMMARY:
221+
=== TOTAL INSERTIONS: 1796
222+
=== TOTAL DELETIONS: 10090
223+
=== TOTAL SUBSTITUTIONS: 12567
224+
=== TOTAL NCHANGE: 24453
225+
=== TOTAL TRUE WORD COUNT: 104372
226+
=== TOTAL TEST WORD COUNT: 96078
227+
=== TOTAL TRUE WORDS RECALL: 0.783
228+
=== TOTAL TEST WORDS PRECISION: 0.851
229+
=== F MEASURE: 0.815
230+
=== OOV Rate: 0.058
231+
=== OOV Recall Rate: 0.582
232+
=== IV Recall Rate: 0.795
233+
### pku_cacl2_seg.txt 1796 10090 12567 24453 104372 96078 0.783 0.851 0.815 0.058 0.582 0.795
234+
```
235+
![标准数据集ICWB2上测试分词](https://github.com/limccn/cacl2/blob/master/docs/images/score.png)
236+
237+
[详细评分结果地址](https://github.com/limccn/cacl2/docs/score.txt)
199238

200239
## 五、历史和变更日志
201240
### 1.定期发布版本

STATUES-zh_CN.md

+28-27
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66

77
| 时间 | 总词条数 | 候选词条 | 已公开词条 | 预览版词条 |
88
| :----: | :----: | :----: | :----: | :----: |
9+
| 2021-04-01 | 约21,000,000 | 约3,000,000 | 3,279,518 | 280,000 |
910
| 2021-03-01 | 约21,000,000 | 约3,000,000 | 2,624,625 | 280,000 |
1011
| 2021-02-01 | 约21,000,000 | 约3,000,000 | 2,553,806 | 280,000 |
1112
#### 行业字典数
@@ -21,38 +22,38 @@
2122
| :----: | :---- | :----: | :----: | :----: | :----: | :----: | :----: |
2223
| 110000 | 农林牧渔-通用 | 81,974 | 预览版 | - | v0.1 | txt | [110000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/110000.zip) |
2324
| 210000 | 采掘-通用 | 21,060 | 预览版 | - | v0.1 | txt | [210000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/210000.zip) |
24-
| 220000 | 化工-通用 | 39,263 | 预览版 | - | v0.1 | txt | [220000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/220000.zip) |
25+
| 220000 | 化工-通用 | 43,691 | 预览版 | - | v0.1 | txt | [220000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/220000.zip) |
2526
| 230000 | 钢铁-通用 | 10,904 | 预览版 | - | v0.1 | txt | [230000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/230000.zip) |
26-
| 240000 | 有色金属-通用 | 13,111 | 预览版 | - | v0.1 | txt | [240000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/240000.zip) |
27-
| 270000 | 电子-通用 | 117,379 | 预览版 | - | v0.1 | txt | [270000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/270000.zip) |
28-
| 280000 | 汽车-通用 | 65,041 | 预览版 | - | v0.1 | txt | [280000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/280000.zip) |
29-
| 330000 | 家用电器-通用 | 23,578 | 预览版 | - | v0.1 | txt | [330000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/330000.zip) |
27+
| 240000 | 有色金属-通用 | 16,224 | 预览版 | - | v0.1 | txt | [240000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/240000.zip) |
28+
| 270000 | 电子-通用 | 148,302 | 预览版 | - | v0.1 | txt | [270000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/270000.zip) |
29+
| 280000 | 汽车-通用 | 73,556 | 预览版 | - | v0.1 | txt | [280000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/280000.zip) |
30+
| 330000 | 家用电器-通用 | 30,231 | 预览版 | - | v0.1 | txt | [330000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/330000.zip) |
3031
| 340000 | 食品饮料-通用 | 26,859 | 预览版 | - | v0.1 | txt | [340000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/340000.zip) |
31-
| 350000 | 纺织服装-通用 | 28,230 | 预览版 | - | v0.1 | txt | [350000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/350000.zip) |
32-
| 360000 | 轻工制造-通用 | 15,602 | 预览版 | - | v0.1 | txt | [360000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/360000.zip) |
33-
| 370000 | 医药生物-通用 | 25,676 | 预览版 | - | v0.1 | txt | [370000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/370000.zip) |
34-
| 410000 | 公用事业-通用 | 97,029 | 预览版 | - | v0.1 | txt | [410000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/410000.zip) |
35-
| 420000 | 交通运输-通用 | 61,679 | 预览版 | - | v0.1 | txt | [420000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/420000.zip) |
36-
| 430000 | 房地产-通用 | 91,992 | 预览版 | - | v0.1 | txt | [430000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/430000.zip) |
37-
| 450000 | 商业贸易-通用 | 166,500 | 预览版 | - | v0.1 | txt | [450000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/450000.zip) |
38-
| 460000 | 休闲服务-通用 | 165,955 | 预览版 | - | v0.1 | txt | [460000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/460000.zip) |
39-
| 480000 | 银行-通用 | 40,612 | 发布 | 2020-02 | v0.2 | txt | [480000.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/480000.zip) |
40-
| 490000 | 非银金融-通用 | 353,149 | 发布 | 2020-02 | v0.2 | txt | [490000.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/490000.zip) |
41-
| 510000 | 综合-通用 | 23,731 | 预览版 | - | v0.1 | txt | [510000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/510000.zip) |
42-
| 610000 | 建筑材料-通用 | 18,426 | 预览版 | - | v0.1 | txt | [610000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/610000.zip) |
43-
| 620000 | 建筑装饰-通用 | 40,244 | 预览版 | - | v0.1 | txt | [620000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/620000.zip) |
44-
| 630000 | 电气设备-通用 | 35,683 | 预览版 | - | v0.1 | txt | [630000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/630000.zip) |
45-
| 640000 | 机械设备-通用 | 127,805 | 预览版 | - | v0.1 | txt | [640000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/640000.zip) |
46-
| 650000 | 国防军工-通用 | 23,940 | 预览版 | - | v0.1 | txt | [650000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/650000.zip) |
47-
| 710000 | 计算机-通用 | 76,148 | 预览版 | - | v0.1 | txt | [710000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/710000.zip) |
48-
| 720000 | 传媒-通用 | 66,374 | 预览版 | - | v0.1 | txt | [720000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/720000.zip) |
49-
| 730000 | 通信-通用 | 41,542 | 预览版 | - | v0.1 | txt | [730000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/730000.zip) |
32+
| 350000 | 纺织服装-通用 | 33,728 | 预览版 | - | v0.1 | txt | [350000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/350000.zip) |
33+
| 360000 | 轻工制造-通用 | 46,532 | 预览版 | - | v0.1 | txt | [360000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/360000.zip) |
34+
| 370000 | 医药生物-通用 | 33,171 | 预览版 | - | v0.1 | txt | [370000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/370000.zip) |
35+
| 410000 | 公用事业-通用 | 106,506 | 预览版 | - | v0.1 | txt | [410000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/410000.zip) |
36+
| 420000 | 交通运输-通用 | 64,030 | 预览版 | - | v0.1 | txt | [420000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/420000.zip) |
37+
| 430000 | 房地产-通用 | 108,396 | 预览版 | - | v0.1 | txt | [430000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/430000.zip) |
38+
| 450000 | 商业贸易-通用 | 240,736 | 预览版 | - | v0.1 | txt | [450000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/450000.zip) |
39+
| 460000 | 休闲服务-通用 | 201,698 | 预览版 | - | v0.1 | txt | [460000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/460000.zip) |
40+
| 480000 | 银行-通用 | 52,105 | 发布 | 2020-02 | v0.2 | txt | [480000.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/480000.zip) |
41+
| 490000 | 非银金融-通用 | 365,878 | 发布 | 2020-02 | v0.2 | txt | [490000.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/490000.zip) |
42+
| 510000 | 综合-通用 | 183,582 | 预览版 | - | v0.1 | txt | [510000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/510000.zip) |
43+
| 610000 | 建筑材料-通用 | 53,549 | 预览版 | - | v0.1 | txt | [610000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/610000.zip) |
44+
| 620000 | 建筑装饰-通用 | 60,544 | 预览版 | - | v0.1 | txt | [620000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/620000.zip) |
45+
| 630000 | 电气设备-通用 | 73,085 | 预览版 | - | v0.1 | txt | [630000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/630000.zip) |
46+
| 640000 | 机械设备-通用 | 223,578 | 预览版 | - | v0.1 | txt | [640000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/640000.zip) |
47+
| 650000 | 国防军工-通用 | 36,563 | 预览版 | - | v0.1 | txt | [650000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/650000.zip) |
48+
| 710000 | 计算机-通用 | 117,364 | 预览版 | - | v0.1 | txt | [710000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/710000.zip) |
49+
| 720000 | 传媒-通用 | 117,009 | 预览版 | - | v0.1 | txt | [720000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/720000.zip) |
50+
| 730000 | 通信-通用 | 48,606 | 预览版 | - | v0.1 | txt | [730000.zip](https://github.com/limccn/cacl2/blob/master/archive/preview/730000.zip) |
5051

5152
### 二级行业词库
5253
| 行业代码 | 一级行业 | 词库名称 | 词条数量 | 当前状态 | 公开时间 | 当前版本 | 格式 | 下载地址 |
5354
| :----: |:---- | :---- | :----: | :----: | :----: | :----: | :----: | :----: |
5455
| 480000 | 银行| | | | | | | |
55-
| 480100 | | 银行-银行 | 224,433 | 发布 | 2021-02 | v0.2 | txt | [480100.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/480100.zip) |
56+
| 480100 | | 银行-银行 | 232,434 | 发布 | 2021-02 | v0.2 | txt | [480100.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/480100.zip) |
5657
| 490000 | 非银金融| | | | | | | |
57-
| 490100 | | 非银金融-证券 | 324,450 | 发布 | 2021-02 | v0.2 | txt | [490100.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/490100.zip) |
58-
| 490200 | | 非银金融-保险 | 31,020 | 发布 | 2021-02 | v0.2 | txt | [480200.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/480200.zip) |
58+
| 490100 | | 非银金融-证券 | 338,428 | 发布 | 2021-02 | v0.2 | txt | [490100.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/490100.zip) |
59+
| 490200 | | 非银金融-保险 | 45,388 | 发布 | 2021-02 | v0.2 | txt | [480200.zip](https://github.com/limccn/cacl2/blob/master/archive/v0.2/480200.zip) |

archive/v0.2/480000.zip

108 KB
Binary file not shown.

archive/v0.2/480100.zip

1.05 MB
Binary file not shown.

archive/v0.2/490000.zip

1.75 MB
Binary file not shown.

archive/v0.2/490100.zip

1.62 MB
Binary file not shown.

archive/v0.2/490200.zip

230 KB
Binary file not shown.

archive/v0.2/490300.zip

689 KB
Binary file not shown.

0 commit comments

Comments
 (0)