Skip to content

Commit

Permalink
dmr model
Browse files Browse the repository at this point in the history
  • Loading branch information
yinhaofeng committed Jun 29, 2021
1 parent c35e90b commit a0990c0
Show file tree
Hide file tree
Showing 14 changed files with 1,597 additions and 0 deletions.
10 changes: 10 additions & 0 deletions datasets/Ali_Display_Ad_Click/get_data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
mkdir raw_data
cd raw_data
wget https://paddlerec.bj.bcebos.com/datasets/dmr/user_profile.csv.tar.gz
tar -zxvf user_profile.csv.tar.gz
wget https://paddlerec.bj.bcebos.com/datasets/dmr/raw_sample.csv.tar.gz
tar -zxvf raw_sample.csv.tar.gz
wget https://paddlerec.bj.bcebos.com/datasets/dmr/behavior_log.csv.tar.gz
tar -zxvf behavior_log.csv.tar.gz
wget https://paddlerec.bj.bcebos.com/datasets/dmr/ad_feature.csv.tar.gz
tar -zxvf ad_feature.csv.tar.gz
59 changes: 59 additions & 0 deletions datasets/Ali_Display_Ad_Click/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Ali_Display_Ad_Click数据集
Ali_Display_Ad_Click是阿里巴巴提供的一个淘宝展示广告点击率预估数据集

## 原始数据集介绍
- 原始样本骨架raw_sample:淘宝网站中随机抽样了114万用户8天内的广告展示/点击日志(2600万条记录),构成原始的样本骨架
1. user:脱敏过的用户ID;
2. adgroup_id:脱敏过的广告单元ID;
3. time_stamp:时间戳;
4. pid:资源位;
5. nonclk:为1代表没有点击;为0代表点击;
6. clk:为0代表没有点击;为1代表点击;

```
user,time_stamp,adgroup_id,pid,nonclk,clk
581738,1494137644,1,430548_1007,1,0
```

- 广告基本信息表ad_feature:本数据集涵盖了raw_sample中全部广告的基本信息
1. adgroup_id:脱敏过的广告ID;
2. cate_id:脱敏过的商品类目ID;
3. campaign_id:脱敏过的广告计划ID;
4. customer: 脱敏过的广告主ID;
5. brand:脱敏过的品牌ID;
6. price: 宝贝的价格
```
adgroup_id,cate_id,campaign_id,customer,brand,price
63133,6406,83237,1,95471,170.0
```

- 用户基本信息表user_profile:本数据集涵盖了raw_sample中全部用户的基本信息
1. userid:脱敏过的用户ID;
2. cms_segid:微群ID;
3. cms_group_id:cms_group_id;
4. final_gender_code:性别 1:男,2:女;
5. age_level:年龄层次; 1234
6. pvalue_level:消费档次,1:低档,2:中档,3:高档;
7. shopping_level:购物深度,1:浅层用户,2:中度用户,3:深度用户
8. occupation:是否大学生 ,1:是,0:否
9. new_user_class_level:城市层级
```
userid,cms_segid,cms_group_id,final_gender_code,age_level,pvalue_level,shopping_level,occupation,new_user_class_level
234,0,5,2,5,,3,0,3
```

- 用户的行为日志behavior_log:本数据集涵盖了raw_sample中全部用户22天内的购物行为
1. user:脱敏过的用户ID;
2. time_stamp:时间戳;
3. btag:行为类型, 包括以下四种:(pv:浏览),(cart:加入购物车),(fav:喜欢),(buy:购买)
4. cate:脱敏过的商品类目id;
5. brand: 脱敏过的品牌id;
```
user,time_stamp,btag,cate,brand
558157,1493741625,pv,6250,91286
```

## 预处理数据集介绍
将原始数据集中四个文件汇总到一个文件中,行程可以被reader直接读取的数据集。
数据集一共267列。使用时间戳来划分原始数据集中的raw_sample.csv,用前面7天的做训练样本(20170506-20170512),用第8天的做测试样本(20170513)。[0:150]列是根据raw_sample.csv每行记录查找对应的50条历史数据,[150:200]列是[50:100]的mask,[200:250]列是[100:150]的mask,[250:259]列为user_profile.csv中的各项特征,[259:265]列为ad_feature.csv中的各项特征,最后一列[266]为用户是否点击。
详细的[数据处理](https://aistudio.baidu.com/aistudio/projectdetail/1805731)过程点此查看。
7 changes: 7 additions & 0 deletions datasets/Ali_Display_Ad_Click/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
mkdir big_train
mkdir big_test
wget https://paddlerec.bj.bcebos.com/datasets/dmr/dataset_full.zip
unzip dataset_full.zip
mv work/train_sorted.csv big_train/
mv work/test.csv big_test/
rm -rf work
1 change: 1 addition & 0 deletions datasets/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,4 @@ sh data_process.sh
|[Netflix](https://paddlerec.bj.bcebos.com/datasets/Netflix/Netflix.zip)|这是Netflix竞赛中使用的官方数据集。|[Kaggle](https://www.kaggle.com/netflix-inc/netflix-prize-data)|
|[FourSquare](https://paddlerec.bj.bcebos.com/datasets/FourSquare/FourSquare.zip)|此数据集包含在纽约和东京进行的大约10个月收集的签到。每个签到都有其时间戳,GPS坐标及其语义相关联。|[Kaggle](https://www.kaggle.com/chetanism/foursquare-nyc-and-tokyo-checkin-dataset)|
|[AmazonBook](https://paddlerec.bj.bcebos.com/datasets/AmazonBook/AmazonBook.tar.gz)|论文原作者处理过的AmazonBook数据集 |[《Controllable Multi-Interest Framework for Recommendation》](https://arxiv.org/abs/2005.09347)|
|[Ali_Display_Ad_Click](https://paddlerec.bj.bcebos.com/datasets/dmr/dataset_full.zip)|预处理过的Alimama数据集 |[Deep Match to Rank Model for Personalized Click-Through Rate Prediction](https://github.com/lvze92/DMR)|
Binary file added doc/imgs/dmr.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
13 changes: 13 additions & 0 deletions models/rank/dmr/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
36 changes: 36 additions & 0 deletions models/rank/dmr/alimama_reader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from __future__ import print_function
import numpy as np

from paddle.io import IterableDataset


class RecDataset(IterableDataset):
def __init__(self, file_list, config):
super(RecDataset, self).__init__()
self.file_list = file_list

def __iter__(self):
for file in self.file_list:
with open(file, "r") as rf:
for l in rf:
l = l.strip().split(",")
l = [
'0' if i == '' or i.upper() == 'NULL' else i for i in l
] # handle missing values
output_list = []
output_list.append(np.array(l).astype('float32'))
yield output_list
66 changes: 66 additions & 0 deletions models/rank/dmr/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# global settings

runner:
train_data_dir: "data/sample_data"
train_reader_path: "alimama_reader" # importlib format
use_gpu: False
use_auc: True
train_batch_size: 100
epochs: 1
print_interval: 1
# model_init_path: "output_model_dmr/0" # init model
model_save_path: "output_model_dmr"
test_data_dir: "data/sample_data"
infer_reader_path: "alimama_reader" # importlib format
infer_batch_size: 256
infer_load_path: "output_model_dmr"
infer_start_epoch: 0
infer_end_epoch: 1

# hyper parameters of user-defined network
hyper_parameters:
# optimizer config
optimizer:
class: Adam
learning_rate: 0.008
strategy: async
# user-defined <key, value> pairs
# user feature size
user_size: 1141730
cms_segid_size: 97
cms_group_id_size: 13
final_gender_code_size: 3
age_level_size: 7
pvalue_level_size: 4
shopping_level_size: 4
occupation_size: 3
new_user_class_level_size: 5

# item feature size
adgroup_id_size: 846812
cate_size: 12978
campaign_id_size: 423437
customer_size: 255876
brand_size: 461529

# context feature size
btag_size: 5
pid_size: 2

# embedding size
main_embedding_size: 32
other_embedding_size: 8
65 changes: 65 additions & 0 deletions models/rank/dmr/config_bigdata.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# global settings

runner:
train_data_dir: "../../../datasets/Ali_Display_Ad_Click/big_train"
train_reader_path: "alimama_reader" # importlib format
use_gpu: True
use_auc: True
train_batch_size: 5120
epochs: 1
print_interval: 100

model_save_path: "output_model_all_dmr"
test_data_dir: "../../../datasets/Ali_Display_Ad_Click/big_test"
infer_reader_path: "alimama_reader" # importlib format
infer_batch_size: 5120
infer_load_path: "output_model_all_dmr"
infer_start_epoch: 0
infer_end_epoch: 1

# hyper parameters of user-defined network
hyper_parameters:
# optimizer config
optimizer:
class: Adam
learning_rate: 0.008
# user-defined <key, value> pairs
# user feature size
user_size: 1141730
cms_segid_size: 97
cms_group_id_size: 13
final_gender_code_size: 3
age_level_size: 7
pvalue_level_size: 4
shopping_level_size: 4
occupation_size: 3
new_user_class_level_size: 5

# item feature size
adgroup_id_size: 846812
cate_size: 12978
campaign_id_size: 423437
customer_size: 255876
brand_size: 461529

# context feature size
btag_size: 5
pid_size: 2

# embedding size
main_embedding_size: 32
other_embedding_size: 8
Loading

0 comments on commit a0990c0

Please sign in to comment.