PaddlePaddle
diff --git a/‎datasets/Ali_Display_Ad_Click/get_data.sh
Lines changed: 10 additions & 0 deletions b/‎datasets/Ali_Display_Ad_Click/get_data.sh
Lines changed: 10 additions & 0 deletions
diff --git a/‎datasets/Ali_Display_Ad_Click/readme.md
Lines changed: 59 additions & 0 deletions b/‎datasets/Ali_Display_Ad_Click/readme.md
Lines changed: 59 additions & 0 deletions
diff --git a/‎datasets/Ali_Display_Ad_Click/run.sh
Lines changed: 7 additions & 0 deletions b/‎datasets/Ali_Display_Ad_Click/run.sh
Lines changed: 7 additions & 0 deletions
diff --git a/‎datasets/readme.md
Lines changed: 1 addition & 0 deletions b/‎datasets/readme.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎doc/imgs/dmr.png
264 KB b/‎doc/imgs/dmr.png
264 KB
diff --git a/‎models/rank/dmr/__init__.py
Lines changed: 13 additions & 0 deletions b/‎models/rank/dmr/__init__.py
Lines changed: 13 additions & 0 deletions
diff --git a/‎models/rank/dmr/alimama_reader.py
Lines changed: 36 additions & 0 deletions b/‎models/rank/dmr/alimama_reader.py
Lines changed: 36 additions & 0 deletions
diff --git a/‎models/rank/dmr/config.yaml
Lines changed: 66 additions & 0 deletions b/‎models/rank/dmr/config.yaml
Lines changed: 66 additions & 0 deletions
diff --git a/‎models/rank/dmr/config_bigdata.yaml
Lines changed: 65 additions & 0 deletions b/‎models/rank/dmr/config_bigdata.yaml
Lines changed: 65 additions & 0 deletions
@@ -0,0 +1,10 @@
+mkdir raw_data
+cd raw_data
+wget https://paddlerec.bj.bcebos.com/datasets/dmr/user_profile.csv.tar.gz
+tar -zxvf user_profile.csv.tar.gz
+wget https://paddlerec.bj.bcebos.com/datasets/dmr/raw_sample.csv.tar.gz
+tar -zxvf raw_sample.csv.tar.gz
+wget https://paddlerec.bj.bcebos.com/datasets/dmr/behavior_log.csv.tar.gz
+tar -zxvf behavior_log.csv.tar.gz
+wget https://paddlerec.bj.bcebos.com/datasets/dmr/ad_feature.csv.tar.gz
+tar -zxvf ad_feature.csv.tar.gz
@@ -0,0 +1,59 @@
+# Ali_Display_Ad_Click数据集
+Ali_Display_Ad_Click是阿里巴巴提供的一个淘宝展示广告点击率预估数据集
+
+## 原始数据集介绍
+- 原始样本骨架raw_sample：淘宝网站中随机抽样了114万用户8天内的广告展示/点击日志（2600万条记录），构成原始的样本骨架
+1. user：脱敏过的用户ID；
+2. adgroup_id：脱敏过的广告单元ID；
+3. time_stamp：时间戳；
+4. pid：资源位；
+5. nonclk：为1代表没有点击；为0代表点击；
+6. clk：为0代表没有点击；为1代表点击；
+
+```
+user,time_stamp,adgroup_id,pid,nonclk,clk
+581738,1494137644,1,430548_1007,1,0
+```
+
+- 广告基本信息表ad_feature：本数据集涵盖了raw_sample中全部广告的基本信息
+1. adgroup_id：脱敏过的广告ID；
+2. cate_id：脱敏过的商品类目ID；
+3. campaign_id：脱敏过的广告计划ID；
+4. customer: 脱敏过的广告主ID；
+5. brand：脱敏过的品牌ID；
+6. price: 宝贝的价格
+```
+adgroup_id,cate_id,campaign_id,customer,brand,price
+63133,6406,83237,1,95471,170.0
+```
+
+- 用户基本信息表user_profile：本数据集涵盖了raw_sample中全部用户的基本信息
+1. userid：脱敏过的用户ID；
+2. cms_segid：微群ID；
+3. cms_group_id：cms_group_id；
+4. final_gender_code：性别 1:男,2:女；
+5. age_level：年龄层次； 1234
+6. pvalue_level：消费档次，1:低档，2:中档，3:高档；
+7. shopping_level：购物深度，1:浅层用户,2:中度用户,3:深度用户
+8. occupation：是否大学生 ，1:是,0:否
+9. new_user_class_level：城市层级
+```
+userid,cms_segid,cms_group_id,final_gender_code,age_level,pvalue_level,shopping_level,occupation,new_user_class_level 
+234,0,5,2,5,,3,0,3
+```
+
+- 用户的行为日志behavior_log：本数据集涵盖了raw_sample中全部用户22天内的购物行为
+1. user：脱敏过的用户ID；
+2. time_stamp：时间戳；
+3. btag：行为类型, 包括以下四种：(pv:浏览),(cart:加入购物车),(fav:喜欢),(buy:购买)
+4. cate：脱敏过的商品类目id；
+5. brand: 脱敏过的品牌id；
+```
+user,time_stamp,btag,cate,brand
+558157,1493741625,pv,6250,91286
+```
+
+## 预处理数据集介绍
+将原始数据集中四个文件汇总到一个文件中，行程可以被reader直接读取的数据集。
+数据集一共267列。使用时间戳来划分原始数据集中的raw_sample.csv，用前面7天的做训练样本（20170506-20170512），用第8天的做测试样本（20170513）。[0:150]列是根据raw_sample.csv每行记录查找对应的50条历史数据，[150:200]列是[50:100]的mask，[200:250]列是[100:150]的mask，[250:259]列为user_profile.csv中的各项特征，[259:265]列为ad_feature.csv中的各项特征，最后一列[266]为用户是否点击。
+详细的[数据处理](https://aistudio.baidu.com/aistudio/projectdetail/1805731)过程点此查看。
@@ -0,0 +1,7 @@
+mkdir big_train
+mkdir big_test
+wget https://paddlerec.bj.bcebos.com/datasets/dmr/dataset_full.zip
+unzip dataset_full.zip
+mv work/train_sorted.csv big_train/
+mv work/test.csv big_test/
+rm -rf work
@@ -37,3 +37,4 @@ sh data_process.sh
  |[Netflix](https://paddlerec.bj.bcebos.com/datasets/Netflix/Netflix.zip)|这是Netflix竞赛中使用的官方数据集。|[Kaggle](https://www.kaggle.com/netflix-inc/netflix-prize-data)|
  |[FourSquare](https://paddlerec.bj.bcebos.com/datasets/FourSquare/FourSquare.zip)|此数据集包含在纽约和东京进行的大约10个月收集的签到。每个签到都有其时间戳，GPS坐标及其语义相关联。|[Kaggle](https://www.kaggle.com/chetanism/foursquare-nyc-and-tokyo-checkin-dataset)|
  |[AmazonBook](https://paddlerec.bj.bcebos.com/datasets/AmazonBook/AmazonBook.tar.gz)|论文原作者处理过的AmazonBook数据集 |[《Controllable Multi-Interest Framework for Recommendation》](https://arxiv.org/abs/2005.09347)|
+ |[Ali_Display_Ad_Click](https://paddlerec.bj.bcebos.com/datasets/dmr/dataset_full.zip)|预处理过的Alimama数据集 |[Deep Match to Rank Model for Personalized Click-Through Rate Prediction](https://github.com/lvze92/DMR)|
@@ -0,0 +1,13 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
@@ -0,0 +1,36 @@
+#   Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from __future__ import print_function
+import numpy as np
+
+from paddle.io import IterableDataset
+
+
+class RecDataset(IterableDataset):
+    def __init__(self, file_list, config):
+        super(RecDataset, self).__init__()
+        self.file_list = file_list
+
+    def __iter__(self):
+        for file in self.file_list:
+            with open(file, "r") as rf:
+                for l in rf:
+                    l = l.strip().split(",")
+                    l = [
+                        '0' if i == '' or i.upper() == 'NULL' else i for i in l
+                    ]  # handle missing values
+                    output_list = []
+                    output_list.append(np.array(l).astype('float32'))
+                    yield output_list
@@ -0,0 +1,66 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# global settings
+
+runner:
+  train_data_dir: "data/sample_data"
+  train_reader_path: "alimama_reader" # importlib format
+  use_gpu: False
+  use_auc: True
+  train_batch_size: 100
+  epochs: 1
+  print_interval: 1
+  # model_init_path: "output_model_dmr/0" # init model
+  model_save_path: "output_model_dmr"
+  test_data_dir: "data/sample_data"
+  infer_reader_path: "alimama_reader" # importlib format
+  infer_batch_size: 256
+  infer_load_path: "output_model_dmr"
+  infer_start_epoch: 0
+  infer_end_epoch: 1
+
+# hyper parameters of user-defined network
+hyper_parameters:
+  # optimizer config
+  optimizer:
+    class: Adam
+    learning_rate: 0.008
+    strategy: async
+  # user-defined <key, value> pairs
+  # user feature size
+  user_size: 1141730
+  cms_segid_size: 97
+  cms_group_id_size: 13
+  final_gender_code_size: 3
+  age_level_size: 7
+  pvalue_level_size: 4
+  shopping_level_size: 4
+  occupation_size: 3
+  new_user_class_level_size: 5
+
+  # item feature size
+  adgroup_id_size: 846812
+  cate_size: 12978
+  campaign_id_size: 423437
+  customer_size: 255876
+  brand_size: 461529
+
+  # context feature size
+  btag_size: 5
+  pid_size: 2
+
+  # embedding size
+  main_embedding_size: 32
+  other_embedding_size: 8
@@ -0,0 +1,65 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# global settings
+
+runner:
+  train_data_dir: "../../../datasets/Ali_Display_Ad_Click/big_train"
+  train_reader_path: "alimama_reader" # importlib format
+  use_gpu: True
+  use_auc: True
+  train_batch_size: 5120
+  epochs: 1
+  print_interval: 100
+
+  model_save_path: "output_model_all_dmr"
+  test_data_dir: "../../../datasets/Ali_Display_Ad_Click/big_test"
+  infer_reader_path: "alimama_reader" # importlib format
+  infer_batch_size: 5120
+  infer_load_path: "output_model_all_dmr"
+  infer_start_epoch: 0
+  infer_end_epoch: 1
+
+# hyper parameters of user-defined network
+hyper_parameters:
+  # optimizer config
+  optimizer:
+    class: Adam
+    learning_rate: 0.008
+  # user-defined <key, value> pairs
+  # user feature size
+  user_size: 1141730
+  cms_segid_size: 97
+  cms_group_id_size: 13
+  final_gender_code_size: 3
+  age_level_size: 7
+  pvalue_level_size: 4
+  shopping_level_size: 4
+  occupation_size: 3
+  new_user_class_level_size: 5
+
+  # item feature size
+  adgroup_id_size: 846812
+  cate_size: 12978
+  campaign_id_size: 423437
+  customer_size: 255876
+  brand_size: 461529
+
+  # context feature size
+  btag_size: 5
+  pid_size: 2
+
+  # embedding size
+  main_embedding_size: 32
+  other_embedding_size: 8