Skip to content

Commit

Permalink
update data desc
Browse files Browse the repository at this point in the history
  • Loading branch information
ChenglongChen committed Dec 1, 2018
1 parent 90a338c commit 6a9a65a
Show file tree
Hide file tree
Showing 9 changed files with 88 additions and 37 deletions.
20 changes: 1 addition & 19 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -104,22 +104,4 @@ venv.bak/

#
.idea
__pycache__
src/inputs/pair_generator.py
src/models/stacking_model.py
src/models/calibration.py
tmp.py
*.ipynb*
weights
*.sh
run*
*.csv
features.py
gen_stacking.py
generate_train_valid_split.py
split.pkl
data
logs
output
sub
summary
__pycache__
16 changes: 0 additions & 16 deletions DATA.md

This file was deleted.

6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,11 @@ Ongoing project for implementing various Deep Semantic Matching Models (DSMM). D

## Quickstart
### Data
This project is developed with regard to the data format provided in the [第三届魔镜杯大赛](https://www.ppdai.ai/mirror/goToMirrorDetail?mirrorId=1). You should see the data format description there (or see `DATA.md`) and prepared data accordingly. If you want to run a quick demo, please download data there.
This project is developed with regard to the data format provided in the [第三届魔镜杯大赛](https://www.ppdai.ai/mirror/goToMirrorDetail?mirrorId=1).

Your data should be placed in the `data` directory.
You can see `/data/DATA.md` for the data format description and prepared data accordingly. Your data should be placed in the `data` directory.

If you want to run a quick demo, you can download data from the above competition link.

### Demo
```bash
Expand Down
59 changes: 59 additions & 0 deletions data/DATA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Data Format
## char_embed.txt
This file should contains the char embedding.

Each line should be `char_id embedding_vector`. For example,
```text
C1 0 0 0 0
C2 0.1 0.5 0.4 0.2
C3 0.8 0.2 0.9 1.0
C4 0.14 0.15 0.64 0.12
```

## word_embed.txt
This file should contains the word embedding.

Each line should be `word_id embedding_vector`. For example,
```text
W1 0 0 0 0
W2 0.1 0.5 0.4 0.2
W3 0.8 0.2 0.9 1.0
W4 0.14 0.15 0.64 0.12
```

## question.csv
This file should contains all the question that appears in `train.csv` and `test.csv`.

Each line should be `question_id,word_sequence_ids,char_sequence_ids`. For example,
```text
qid,words,chars
Q1,W1 W2 W3,C31 C64 C45 C85
Q2,W2 W9 W7 W10 W20,C39 C58 C3
Q3,W23 W91 W7 W10 W290,C19 C81 C31
Q4,W25 W9 W70 W101 W210,C92 C58 C33
Q5,W22 W9 W7 W130 W20,C98 C85 C35
Q6,W2 W19 W87,C39 C86 C34
```

## train.csv
This file should contains the training question pairs.

Each line should be `label,q1,q2`, where `label=1` means `q1` (`q1` is the id of question 1) and `q2` (`q2` is the id of question 2) is of the same meaning. `label=0` means they have different meanings. For example
```text
label,q1,q2
1,Q1,Q2
0,Q1,Q3
0,Q2,Q4
0,Q5,Q1
1,Q2,Q6
```

## test.csv
This file should contains the testing question pairs.

Each line should be `q1,q2`, where `q1` is the id of question 1 and `q2` is the id of question 2. For example
```text
q1,q2
Q2,Q3
Q6,Q5
```
4 changes: 4 additions & 0 deletions data/char_embed.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
C1 0 0 0 0
C2 0.1 0.5 0.4 0.2
C3 0.8 0.2 0.9 1.0
C4 0.14 0.15 0.64 0.12
7 changes: 7 additions & 0 deletions data/question.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
qid,words,chars
Q1,W1 W2 W3,C31 C64 C45 C85
Q2,W2 W9 W7 W10 W20,C39 C58 C3
Q3,W23 W91 W7 W10 W290,C19 C81 C31
Q4,W25 W9 W70 W101 W210,C92 C58 C33
Q5,W22 W9 W7 W130 W20,C98 C85 C35
Q6,W2 W19 W87,C39 C86 C34
3 changes: 3 additions & 0 deletions data/test.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
q1,q2
Q2,Q3
Q6,Q5
6 changes: 6 additions & 0 deletions data/train.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
label,q1,q2
1,Q1,Q2
0,Q1,Q3
0,Q2,Q4
0,Q5,Q1
1,Q2,Q6
4 changes: 4 additions & 0 deletions data/word_embed.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
W1 0 0 0 0
W2 0.1 0.5 0.4 0.2
W3 0.8 0.2 0.9 1.0
W4 0.14 0.15 0.64 0.12

0 comments on commit 6a9a65a

Please sign in to comment.