Skip to content

Yvette0828/ChineseSongComparison

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DWave model

for comparing whether two songs are same song or not

FlowChart

Setup

# Install python dependencies
pip install -r requirements.txt

Prediction

python3 model_run.py

Description

The model contains two parts:

  1. DSSM model: comparing contributors of songs
  2. PERT model: comparing titles of songs
DSSM
├── test_bmat_contributors_match.py (to train DSSM)
│
├── data (the necessary lexicon and corpus)
│   │
│   ├── contributors_dict.json (the dictionary of contributors)
│   │
│   ├── QA_DSP2_2020S2_2 (dw)_checked.xlsx (the training data)
│   │
│   └── QA_DSP1_20221h_Suspense - DW.xlsx (the testing data)
│
└── dssm-model (model path)

PERT
│── dssm_process.py (to process the output of DSSM for the input of PERT)
│── PinyinCharDataProcesser.py (to provide the dataset)
│── py2wordPert.py (to do the Pinyin-to-character conversion task by PERT)
│
├── NEZHA (the NEZHA language model)
│
├── Configs (the configurations to train PERT at various scals)
│
├── Corpus (the necessary lexicon and the example corpus)
│   ├── CharListFrmC4P.txt (the list of Chinese characters)
│   ├── pinyinList.txt (the list of pinyin tokens)
│   ├── ModernChineseLexicon4PinyinMapping.txt (the word items and the corresponding pinyin tokens in Modern Chinese Lexicon)
│   ├── PERT_title_Chinese_test.txt (the corpus of Chinese character)
│   └── PERT_title_pinyin_test.txt (the corpus of pinyin)
│
└── Models 
    ├── Bigram (the Bigram model trained on some news corpus)
    └── pert_tiny_py_lr5e4_10Bs_1e (the PERT model trained on some news corpus under the conditions of learning rate: 5e-4, batch size: 10, and epoch number: 1)

Result

Result Folder
│
├── False_threshold_07.xlsx (false result of DSSM when threshold = 0.70)
├── False_threshold_085.xlsx (false result of DSSM when threshold = 0.85)
├── PERT_result_07.xlsx (PERT result when threshold = 0.70)
├── PERT_result_085.xlsx (PERT result when threshold = 0.85)
├── merge_result_07.xlsx (merge result of DSSM & PERT when threshold = 0.70) 
├── merge_result_085.xlsx (merge result of DSSM & PERT when threshold = 0.85) 
└── exceptionSongTitle.txt (data which cannot be predicted in PERT)

Reference

@inproceedings{huang2013learning,
  title={Learning deep structured semantic models for web search using clickthrough data},
  author={Huang, Po-Sen and He, Xiaodong and Gao, Jianfeng and Deng, Li and Acero, Alex and Heck, Larry},
  booktitle={Proceedings of the 22nd ACM international conference on Information \& Knowledge Management},
  pages={2333--2338},
  year={2013}
}
@article{DBLP:journals/corr/abs-2205-11737,
  author    = {Jinghui Xiao and
               Qun Liu and
               Xin Jiang and
               Yuanfeng Xiong and
               Haiteng Wu and
               Zhe Zhang},
  title     = {{PERT:} {A} New Solution to Pinyin to Character Conversion Task},
  journal   = {CoRR},
  volume    = {abs/2205.11737},
  year      = {2022},
  url       = {https://doi.org/10.48550/arXiv.2205.11737},
  doi       = {10.48550/arXiv.2205.11737},
  eprinttype = {arXiv},
  eprint    = {2205.11737},
  timestamp = {Mon, 30 May 2022 15:47:29 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2205-11737.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages