Skip to content

Commit a0990c0

Browse files
committed
dmr model
1 parent c35e90b commit a0990c0

File tree

14 files changed

+1597
-0
lines changed

14 files changed

+1597
-0
lines changed
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
mkdir raw_data
2+
cd raw_data
3+
wget https://paddlerec.bj.bcebos.com/datasets/dmr/user_profile.csv.tar.gz
4+
tar -zxvf user_profile.csv.tar.gz
5+
wget https://paddlerec.bj.bcebos.com/datasets/dmr/raw_sample.csv.tar.gz
6+
tar -zxvf raw_sample.csv.tar.gz
7+
wget https://paddlerec.bj.bcebos.com/datasets/dmr/behavior_log.csv.tar.gz
8+
tar -zxvf behavior_log.csv.tar.gz
9+
wget https://paddlerec.bj.bcebos.com/datasets/dmr/ad_feature.csv.tar.gz
10+
tar -zxvf ad_feature.csv.tar.gz
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Ali_Display_Ad_Click数据集
2+
Ali_Display_Ad_Click是阿里巴巴提供的一个淘宝展示广告点击率预估数据集
3+
4+
## 原始数据集介绍
5+
- 原始样本骨架raw_sample:淘宝网站中随机抽样了114万用户8天内的广告展示/点击日志(2600万条记录),构成原始的样本骨架
6+
1. user:脱敏过的用户ID;
7+
2. adgroup_id:脱敏过的广告单元ID;
8+
3. time_stamp:时间戳;
9+
4. pid:资源位;
10+
5. nonclk:为1代表没有点击;为0代表点击;
11+
6. clk:为0代表没有点击;为1代表点击;
12+
13+
```
14+
user,time_stamp,adgroup_id,pid,nonclk,clk
15+
581738,1494137644,1,430548_1007,1,0
16+
```
17+
18+
- 广告基本信息表ad_feature:本数据集涵盖了raw_sample中全部广告的基本信息
19+
1. adgroup_id:脱敏过的广告ID;
20+
2. cate_id:脱敏过的商品类目ID;
21+
3. campaign_id:脱敏过的广告计划ID;
22+
4. customer: 脱敏过的广告主ID;
23+
5. brand:脱敏过的品牌ID;
24+
6. price: 宝贝的价格
25+
```
26+
adgroup_id,cate_id,campaign_id,customer,brand,price
27+
63133,6406,83237,1,95471,170.0
28+
```
29+
30+
- 用户基本信息表user_profile:本数据集涵盖了raw_sample中全部用户的基本信息
31+
1. userid:脱敏过的用户ID;
32+
2. cms_segid:微群ID;
33+
3. cms_group_id:cms_group_id;
34+
4. final_gender_code:性别 1:男,2:女;
35+
5. age_level:年龄层次; 1234
36+
6. pvalue_level:消费档次,1:低档,2:中档,3:高档;
37+
7. shopping_level:购物深度,1:浅层用户,2:中度用户,3:深度用户
38+
8. occupation:是否大学生 ,1:是,0:否
39+
9. new_user_class_level:城市层级
40+
```
41+
userid,cms_segid,cms_group_id,final_gender_code,age_level,pvalue_level,shopping_level,occupation,new_user_class_level
42+
234,0,5,2,5,,3,0,3
43+
```
44+
45+
- 用户的行为日志behavior_log:本数据集涵盖了raw_sample中全部用户22天内的购物行为
46+
1. user:脱敏过的用户ID;
47+
2. time_stamp:时间戳;
48+
3. btag:行为类型, 包括以下四种:(pv:浏览),(cart:加入购物车),(fav:喜欢),(buy:购买)
49+
4. cate:脱敏过的商品类目id;
50+
5. brand: 脱敏过的品牌id;
51+
```
52+
user,time_stamp,btag,cate,brand
53+
558157,1493741625,pv,6250,91286
54+
```
55+
56+
## 预处理数据集介绍
57+
将原始数据集中四个文件汇总到一个文件中,行程可以被reader直接读取的数据集。
58+
数据集一共267列。使用时间戳来划分原始数据集中的raw_sample.csv,用前面7天的做训练样本(20170506-20170512),用第8天的做测试样本(20170513)。[0:150]列是根据raw_sample.csv每行记录查找对应的50条历史数据,[150:200]列是[50:100]的mask,[200:250]列是[100:150]的mask,[250:259]列为user_profile.csv中的各项特征,[259:265]列为ad_feature.csv中的各项特征,最后一列[266]为用户是否点击。
59+
详细的[数据处理](https://aistudio.baidu.com/aistudio/projectdetail/1805731)过程点此查看。

datasets/Ali_Display_Ad_Click/run.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
mkdir big_train
2+
mkdir big_test
3+
wget https://paddlerec.bj.bcebos.com/datasets/dmr/dataset_full.zip
4+
unzip dataset_full.zip
5+
mv work/train_sorted.csv big_train/
6+
mv work/test.csv big_test/
7+
rm -rf work

datasets/readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,4 @@ sh data_process.sh
3737
|[Netflix](https://paddlerec.bj.bcebos.com/datasets/Netflix/Netflix.zip)|这是Netflix竞赛中使用的官方数据集。|[Kaggle](https://www.kaggle.com/netflix-inc/netflix-prize-data)|
3838
|[FourSquare](https://paddlerec.bj.bcebos.com/datasets/FourSquare/FourSquare.zip)|此数据集包含在纽约和东京进行的大约10个月收集的签到。每个签到都有其时间戳,GPS坐标及其语义相关联。|[Kaggle](https://www.kaggle.com/chetanism/foursquare-nyc-and-tokyo-checkin-dataset)|
3939
|[AmazonBook](https://paddlerec.bj.bcebos.com/datasets/AmazonBook/AmazonBook.tar.gz)|论文原作者处理过的AmazonBook数据集 |[《Controllable Multi-Interest Framework for Recommendation》](https://arxiv.org/abs/2005.09347)|
40+
|[Ali_Display_Ad_Click](https://paddlerec.bj.bcebos.com/datasets/dmr/dataset_full.zip)|预处理过的Alimama数据集 |[Deep Match to Rank Model for Personalized Click-Through Rate Prediction](https://github.com/lvze92/DMR)|

doc/imgs/dmr.png

264 KB
Loading

models/rank/dmr/__init__.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.

models/rank/dmr/alimama_reader.py

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
from __future__ import print_function
16+
import numpy as np
17+
18+
from paddle.io import IterableDataset
19+
20+
21+
class RecDataset(IterableDataset):
22+
def __init__(self, file_list, config):
23+
super(RecDataset, self).__init__()
24+
self.file_list = file_list
25+
26+
def __iter__(self):
27+
for file in self.file_list:
28+
with open(file, "r") as rf:
29+
for l in rf:
30+
l = l.strip().split(",")
31+
l = [
32+
'0' if i == '' or i.upper() == 'NULL' else i for i in l
33+
] # handle missing values
34+
output_list = []
35+
output_list.append(np.array(l).astype('float32'))
36+
yield output_list

models/rank/dmr/config.yaml

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# global settings
16+
17+
runner:
18+
train_data_dir: "data/sample_data"
19+
train_reader_path: "alimama_reader" # importlib format
20+
use_gpu: False
21+
use_auc: True
22+
train_batch_size: 100
23+
epochs: 1
24+
print_interval: 1
25+
# model_init_path: "output_model_dmr/0" # init model
26+
model_save_path: "output_model_dmr"
27+
test_data_dir: "data/sample_data"
28+
infer_reader_path: "alimama_reader" # importlib format
29+
infer_batch_size: 256
30+
infer_load_path: "output_model_dmr"
31+
infer_start_epoch: 0
32+
infer_end_epoch: 1
33+
34+
# hyper parameters of user-defined network
35+
hyper_parameters:
36+
# optimizer config
37+
optimizer:
38+
class: Adam
39+
learning_rate: 0.008
40+
strategy: async
41+
# user-defined <key, value> pairs
42+
# user feature size
43+
user_size: 1141730
44+
cms_segid_size: 97
45+
cms_group_id_size: 13
46+
final_gender_code_size: 3
47+
age_level_size: 7
48+
pvalue_level_size: 4
49+
shopping_level_size: 4
50+
occupation_size: 3
51+
new_user_class_level_size: 5
52+
53+
# item feature size
54+
adgroup_id_size: 846812
55+
cate_size: 12978
56+
campaign_id_size: 423437
57+
customer_size: 255876
58+
brand_size: 461529
59+
60+
# context feature size
61+
btag_size: 5
62+
pid_size: 2
63+
64+
# embedding size
65+
main_embedding_size: 32
66+
other_embedding_size: 8

models/rank/dmr/config_bigdata.yaml

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,65 @@
1+
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
# global settings
16+
17+
runner:
18+
train_data_dir: "../../../datasets/Ali_Display_Ad_Click/big_train"
19+
train_reader_path: "alimama_reader" # importlib format
20+
use_gpu: True
21+
use_auc: True
22+
train_batch_size: 5120
23+
epochs: 1
24+
print_interval: 100
25+
26+
model_save_path: "output_model_all_dmr"
27+
test_data_dir: "../../../datasets/Ali_Display_Ad_Click/big_test"
28+
infer_reader_path: "alimama_reader" # importlib format
29+
infer_batch_size: 5120
30+
infer_load_path: "output_model_all_dmr"
31+
infer_start_epoch: 0
32+
infer_end_epoch: 1
33+
34+
# hyper parameters of user-defined network
35+
hyper_parameters:
36+
# optimizer config
37+
optimizer:
38+
class: Adam
39+
learning_rate: 0.008
40+
# user-defined <key, value> pairs
41+
# user feature size
42+
user_size: 1141730
43+
cms_segid_size: 97
44+
cms_group_id_size: 13
45+
final_gender_code_size: 3
46+
age_level_size: 7
47+
pvalue_level_size: 4
48+
shopping_level_size: 4
49+
occupation_size: 3
50+
new_user_class_level_size: 5
51+
52+
# item feature size
53+
adgroup_id_size: 846812
54+
cate_size: 12978
55+
campaign_id_size: 423437
56+
customer_size: 255876
57+
brand_size: 461529
58+
59+
# context feature size
60+
btag_size: 5
61+
pid_size: 2
62+
63+
# embedding size
64+
main_embedding_size: 32
65+
other_embedding_size: 8

0 commit comments

Comments
 (0)