This project aims to identify security patches(SPs) in the OSS's bulk of commits.
We take 4 steps to implement the target and with our work the Repo's maintainer can handle the SPs easier.
- Step 1. The first step is to collect and pre-process the dataset. In this repo, the realistic dataset is from the Qemu's repo in the github and a paper.
- Step 2. Train. Use the data set and a CNN-LSTM model to train and validate
- Step 3. Test. Test the model in Step 2 and get the result. The Accuracy, Precision, Recall, and F1 score are good.
- Step 4. Use the test result we can get the SPs, then we use a list of rules to divide the patches into 4 levels(A, B, C and D) . A means the patch is definitely one of most important SPs, B means important SP, C means normal SP and D means not SP.
With this SP identification system, maintainers can focus on the SPs or important SPs to fix the security bugs and impacts rather than check all commits and find out the SPs. It helps to save time and work more efficiently
- Python 3.7
- Tensorflow = 1.15
- scikit-learn = 0.24.2
- nltk = 3.6.2
- GitPython = 3.1.18
- Pygments = 2.3.1
If you want to train the model, please use tensorflow-gpu, or it will take very long time.
collect commit data from OSS in github. In /data.
use git clone
to clone the target repo, then use git log --pretty=format:"%h" > 1.txt
to save the short IDs of commit.
- getdata.py: save most recent 1000 commits
- filter.py: filter out the commits with key words
- keywordslist.txt: security keywords
- hash.txt: commits' hash list, generated by
git log
To pre process the data:
- filter.py: filter commit messages with keywords and save it to the /traindata
- getdata2.py: to directly collect the most recent 1000 commits and save them into csv, also does the filter work
- qemu/: copied from Qemu's repo
To train the model, run the train.py
python train.py
You can also use the parameters to claim the file to train, e. g.
python train.py --data_file=data\traindata\commitmessage\train\qemu2.csv
More parameters can be use, see the source file "train.py" to use.
Test is as simple as the training stage, run the test.py
python test.py
If you want to use your own trained model or test your own data, use the parameters like this
python test.py --test_data_file=data\traindata\commitmessage\evaluate\qemu2.csv --run_dir=runs\1625470566_cm
More parameters can be use, see the source file "test.py" to use.
To test the csv file in the format that first row include three titles like "label", "content", "hash", run the test_file.py
python test_file.py
All predictions are saved into /runs/[model id]/
Run the patch_identify to get the score
python patch_identify.py
You can also add new scoring rules or custom the levels by edit the patch_identify.py
To check the result, see the result/ folder.