-
Notifications
You must be signed in to change notification settings - Fork 645
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
c35e90b
commit a0990c0
Showing
14 changed files
with
1,597 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
mkdir raw_data | ||
cd raw_data | ||
wget https://paddlerec.bj.bcebos.com/datasets/dmr/user_profile.csv.tar.gz | ||
tar -zxvf user_profile.csv.tar.gz | ||
wget https://paddlerec.bj.bcebos.com/datasets/dmr/raw_sample.csv.tar.gz | ||
tar -zxvf raw_sample.csv.tar.gz | ||
wget https://paddlerec.bj.bcebos.com/datasets/dmr/behavior_log.csv.tar.gz | ||
tar -zxvf behavior_log.csv.tar.gz | ||
wget https://paddlerec.bj.bcebos.com/datasets/dmr/ad_feature.csv.tar.gz | ||
tar -zxvf ad_feature.csv.tar.gz |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,59 @@ | ||
# Ali_Display_Ad_Click数据集 | ||
Ali_Display_Ad_Click是阿里巴巴提供的一个淘宝展示广告点击率预估数据集 | ||
|
||
## 原始数据集介绍 | ||
- 原始样本骨架raw_sample:淘宝网站中随机抽样了114万用户8天内的广告展示/点击日志(2600万条记录),构成原始的样本骨架 | ||
1. user:脱敏过的用户ID; | ||
2. adgroup_id:脱敏过的广告单元ID; | ||
3. time_stamp:时间戳; | ||
4. pid:资源位; | ||
5. nonclk:为1代表没有点击;为0代表点击; | ||
6. clk:为0代表没有点击;为1代表点击; | ||
|
||
``` | ||
user,time_stamp,adgroup_id,pid,nonclk,clk | ||
581738,1494137644,1,430548_1007,1,0 | ||
``` | ||
|
||
- 广告基本信息表ad_feature:本数据集涵盖了raw_sample中全部广告的基本信息 | ||
1. adgroup_id:脱敏过的广告ID; | ||
2. cate_id:脱敏过的商品类目ID; | ||
3. campaign_id:脱敏过的广告计划ID; | ||
4. customer: 脱敏过的广告主ID; | ||
5. brand:脱敏过的品牌ID; | ||
6. price: 宝贝的价格 | ||
``` | ||
adgroup_id,cate_id,campaign_id,customer,brand,price | ||
63133,6406,83237,1,95471,170.0 | ||
``` | ||
|
||
- 用户基本信息表user_profile:本数据集涵盖了raw_sample中全部用户的基本信息 | ||
1. userid:脱敏过的用户ID; | ||
2. cms_segid:微群ID; | ||
3. cms_group_id:cms_group_id; | ||
4. final_gender_code:性别 1:男,2:女; | ||
5. age_level:年龄层次; 1234 | ||
6. pvalue_level:消费档次,1:低档,2:中档,3:高档; | ||
7. shopping_level:购物深度,1:浅层用户,2:中度用户,3:深度用户 | ||
8. occupation:是否大学生 ,1:是,0:否 | ||
9. new_user_class_level:城市层级 | ||
``` | ||
userid,cms_segid,cms_group_id,final_gender_code,age_level,pvalue_level,shopping_level,occupation,new_user_class_level | ||
234,0,5,2,5,,3,0,3 | ||
``` | ||
|
||
- 用户的行为日志behavior_log:本数据集涵盖了raw_sample中全部用户22天内的购物行为 | ||
1. user:脱敏过的用户ID; | ||
2. time_stamp:时间戳; | ||
3. btag:行为类型, 包括以下四种:(pv:浏览),(cart:加入购物车),(fav:喜欢),(buy:购买) | ||
4. cate:脱敏过的商品类目id; | ||
5. brand: 脱敏过的品牌id; | ||
``` | ||
user,time_stamp,btag,cate,brand | ||
558157,1493741625,pv,6250,91286 | ||
``` | ||
|
||
## 预处理数据集介绍 | ||
将原始数据集中四个文件汇总到一个文件中,行程可以被reader直接读取的数据集。 | ||
数据集一共267列。使用时间戳来划分原始数据集中的raw_sample.csv,用前面7天的做训练样本(20170506-20170512),用第8天的做测试样本(20170513)。[0:150]列是根据raw_sample.csv每行记录查找对应的50条历史数据,[150:200]列是[50:100]的mask,[200:250]列是[100:150]的mask,[250:259]列为user_profile.csv中的各项特征,[259:265]列为ad_feature.csv中的各项特征,最后一列[266]为用户是否点击。 | ||
详细的[数据处理](https://aistudio.baidu.com/aistudio/projectdetail/1805731)过程点此查看。 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
mkdir big_train | ||
mkdir big_test | ||
wget https://paddlerec.bj.bcebos.com/datasets/dmr/dataset_full.zip | ||
unzip dataset_full.zip | ||
mv work/train_sorted.csv big_train/ | ||
mv work/test.csv big_test/ | ||
rm -rf work |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
from __future__ import print_function | ||
import numpy as np | ||
|
||
from paddle.io import IterableDataset | ||
|
||
|
||
class RecDataset(IterableDataset): | ||
def __init__(self, file_list, config): | ||
super(RecDataset, self).__init__() | ||
self.file_list = file_list | ||
|
||
def __iter__(self): | ||
for file in self.file_list: | ||
with open(file, "r") as rf: | ||
for l in rf: | ||
l = l.strip().split(",") | ||
l = [ | ||
'0' if i == '' or i.upper() == 'NULL' else i for i in l | ||
] # handle missing values | ||
output_list = [] | ||
output_list.append(np.array(l).astype('float32')) | ||
yield output_list |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# global settings | ||
|
||
runner: | ||
train_data_dir: "data/sample_data" | ||
train_reader_path: "alimama_reader" # importlib format | ||
use_gpu: False | ||
use_auc: True | ||
train_batch_size: 100 | ||
epochs: 1 | ||
print_interval: 1 | ||
# model_init_path: "output_model_dmr/0" # init model | ||
model_save_path: "output_model_dmr" | ||
test_data_dir: "data/sample_data" | ||
infer_reader_path: "alimama_reader" # importlib format | ||
infer_batch_size: 256 | ||
infer_load_path: "output_model_dmr" | ||
infer_start_epoch: 0 | ||
infer_end_epoch: 1 | ||
|
||
# hyper parameters of user-defined network | ||
hyper_parameters: | ||
# optimizer config | ||
optimizer: | ||
class: Adam | ||
learning_rate: 0.008 | ||
strategy: async | ||
# user-defined <key, value> pairs | ||
# user feature size | ||
user_size: 1141730 | ||
cms_segid_size: 97 | ||
cms_group_id_size: 13 | ||
final_gender_code_size: 3 | ||
age_level_size: 7 | ||
pvalue_level_size: 4 | ||
shopping_level_size: 4 | ||
occupation_size: 3 | ||
new_user_class_level_size: 5 | ||
|
||
# item feature size | ||
adgroup_id_size: 846812 | ||
cate_size: 12978 | ||
campaign_id_size: 423437 | ||
customer_size: 255876 | ||
brand_size: 461529 | ||
|
||
# context feature size | ||
btag_size: 5 | ||
pid_size: 2 | ||
|
||
# embedding size | ||
main_embedding_size: 32 | ||
other_embedding_size: 8 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# global settings | ||
|
||
runner: | ||
train_data_dir: "../../../datasets/Ali_Display_Ad_Click/big_train" | ||
train_reader_path: "alimama_reader" # importlib format | ||
use_gpu: True | ||
use_auc: True | ||
train_batch_size: 5120 | ||
epochs: 1 | ||
print_interval: 100 | ||
|
||
model_save_path: "output_model_all_dmr" | ||
test_data_dir: "../../../datasets/Ali_Display_Ad_Click/big_test" | ||
infer_reader_path: "alimama_reader" # importlib format | ||
infer_batch_size: 5120 | ||
infer_load_path: "output_model_all_dmr" | ||
infer_start_epoch: 0 | ||
infer_end_epoch: 1 | ||
|
||
# hyper parameters of user-defined network | ||
hyper_parameters: | ||
# optimizer config | ||
optimizer: | ||
class: Adam | ||
learning_rate: 0.008 | ||
# user-defined <key, value> pairs | ||
# user feature size | ||
user_size: 1141730 | ||
cms_segid_size: 97 | ||
cms_group_id_size: 13 | ||
final_gender_code_size: 3 | ||
age_level_size: 7 | ||
pvalue_level_size: 4 | ||
shopping_level_size: 4 | ||
occupation_size: 3 | ||
new_user_class_level_size: 5 | ||
|
||
# item feature size | ||
adgroup_id_size: 846812 | ||
cate_size: 12978 | ||
campaign_id_size: 423437 | ||
customer_size: 255876 | ||
brand_size: 461529 | ||
|
||
# context feature size | ||
btag_size: 5 | ||
pid_size: 2 | ||
|
||
# embedding size | ||
main_embedding_size: 32 | ||
other_embedding_size: 8 |
Oops, something went wrong.