update data desc

ChenglongChen · Dec 1, 2018 · 6a9a65a · 6a9a65a
1 parent 90a338c
commit 6a9a65a
Show file tree

Hide file tree

Showing 9 changed files with 88 additions and 37 deletions.
diff --git a/.gitignore b/.gitignore
@@ -104,22 +104,4 @@ venv.bak/
 
 #
 .idea
-__pycache__
-src/inputs/pair_generator.py
-src/models/stacking_model.py
-src/models/calibration.py
-tmp.py
-*.ipynb*
-weights
-*.sh
-run*
-*.csv
-features.py
-gen_stacking.py
-generate_train_valid_split.py
-split.pkl
-data
-logs
-output
-sub
-summary
+__pycache__
diff --git a/DATA.md b/DATA.md
diff --git a/README.md b/README.md
@@ -11,9 +11,11 @@ Ongoing project for implementing various Deep Semantic Matching Models (DSMM). D
 
 ## Quickstart
 ### Data
-This project is developed with regard to the data format provided in the [第三届魔镜杯大赛](https://www.ppdai.ai/mirror/goToMirrorDetail?mirrorId=1). You should see the data format description there (or see `DATA.md`) and prepared data accordingly. If you want to run a quick demo, please download data there.
+This project is developed with regard to the data format provided in the [第三届魔镜杯大赛](https://www.ppdai.ai/mirror/goToMirrorDetail?mirrorId=1). 
 
-Your data should be placed in the `data` directory.
+You can see `/data/DATA.md` for the data format description and prepared data accordingly. Your data should be placed in the `data` directory.
+
+If you want to run a quick demo, you can download data from the above competition link.
 
 ### Demo
 ```bash

diff --git a/data/DATA.md b/data/DATA.md
@@ -0,0 +1,59 @@
+# Data Format
+## char_embed.txt
+This file should contains the char embedding.
+
+Each line should be `char_id embedding_vector`. For example,
+```text
+C1 0 0 0 0
+C2 0.1 0.5 0.4 0.2
+C3 0.8 0.2 0.9 1.0
+C4 0.14 0.15 0.64 0.12
+```
+
+## word_embed.txt
+This file should contains the word embedding.
+
+Each line should be `word_id embedding_vector`. For example,
+```text
+W1 0 0 0 0
+W2 0.1 0.5 0.4 0.2
+W3 0.8 0.2 0.9 1.0
+W4 0.14 0.15 0.64 0.12
+```
+
+## question.csv
+This file should contains all the question that appears in `train.csv` and `test.csv`.
+
+Each line should be `question_id,word_sequence_ids,char_sequence_ids`. For example,
+```text
+qid,words,chars
+Q1,W1 W2 W3,C31 C64 C45 C85
+Q2,W2 W9 W7 W10 W20,C39 C58 C3
+Q3,W23 W91 W7 W10 W290,C19 C81 C31
+Q4,W25 W9 W70 W101 W210,C92 C58 C33
+Q5,W22 W9 W7 W130 W20,C98 C85 C35
+Q6,W2 W19 W87,C39 C86 C34
+```
+
+## train.csv
+This file should contains the training question pairs.
+
+Each line should be `label,q1,q2`, where `label=1` means `q1` (`q1` is the id of question 1) and `q2` (`q2` is the id of question 2) is of the same meaning. `label=0` means they have different meanings. For example
+```text
+label,q1,q2
+1,Q1,Q2
+0,Q1,Q3
+0,Q2,Q4
+0,Q5,Q1
+1,Q2,Q6
+```
+
+## test.csv
+This file should contains the testing question pairs.
+
+Each line should be `q1,q2`, where `q1` is the id of question 1 and `q2` is the id of question 2. For example
+```text
+q1,q2
+Q2,Q3
+Q6,Q5
+```
diff --git a/data/char_embed.txt b/data/char_embed.txt
@@ -0,0 +1,4 @@
+C1 0 0 0 0
+C2 0.1 0.5 0.4 0.2
+C3 0.8 0.2 0.9 1.0
+C4 0.14 0.15 0.64 0.12
diff --git a/data/question.csv b/data/question.csv
@@ -0,0 +1,7 @@
+qid,words,chars
+Q1,W1 W2 W3,C31 C64 C45 C85
+Q2,W2 W9 W7 W10 W20,C39 C58 C3
+Q3,W23 W91 W7 W10 W290,C19 C81 C31
+Q4,W25 W9 W70 W101 W210,C92 C58 C33
+Q5,W22 W9 W7 W130 W20,C98 C85 C35
+Q6,W2 W19 W87,C39 C86 C34
diff --git a/data/test.csv b/data/test.csv
@@ -0,0 +1,3 @@
+q1,q2
+Q2,Q3
+Q6,Q5
diff --git a/data/train.csv b/data/train.csv
@@ -0,0 +1,6 @@
+label,q1,q2
+1,Q1,Q2
+0,Q1,Q3
+0,Q2,Q4
+0,Q5,Q1
+1,Q2,Q6
diff --git a/data/word_embed.txt b/data/word_embed.txt
@@ -0,0 +1,4 @@
+W1 0 0 0 0
+W2 0.1 0.5 0.4 0.2
+W3 0.8 0.2 0.9 1.0
+W4 0.14 0.15 0.64 0.12
-Original file line number
+Diff line change
@@ -0,0 +1,3 @@
+    q1,q2
+    Q2,Q3
+    Q6,Q5