Skip to content

A system demo designed to identify security patch from OSS's commits.

Notifications You must be signed in to change notification settings

hanzhehao123/Identify-Security-Patch

Repository files navigation

Identify Security Patch in OSS

This project aims to identify security patches(SPs) in the OSS's bulk of commits.

We take 4 steps to implement the target and with our work the Repo's maintainer can handle the SPs easier.

  • Step 1. The first step is to collect and pre-process the dataset. In this repo, the realistic dataset is from the Qemu's repo in the github and a paper.
  • Step 2. Train. Use the data set and a CNN-LSTM model to train and validate
  • Step 3. Test. Test the model in Step 2 and get the result. The Accuracy, Precision, Recall, and F1 score are good.
  • Step 4. Use the test result we can get the SPs, then we use a list of rules to divide the patches into 4 levels(A, B, C and D) . A means the patch is definitely one of most important SPs, B means important SP, C means normal SP and D means not SP.

With this SP identification system, maintainers can focus on the SPs or important SPs to fix the security bugs and impacts rather than check all commits and find out the SPs. It helps to save time and work more efficiently

Requirements

  • Python 3.7
  • Tensorflow = 1.15
  • scikit-learn = 0.24.2
  • nltk = 3.6.2
  • GitPython = 3.1.18
  • Pygments = 2.3.1

If you want to train the model, please use tensorflow-gpu, or it will take very long time.

Prepare data

collect commit data from OSS in github. In /data.

use git clone to clone the target repo, then use git log --pretty=format:"%h" > 1.txt to save the short IDs of commit.

  • getdata.py: save most recent 1000 commits
  • filter.py: filter out the commits with key words
  • keywordslist.txt: security keywords
  • hash.txt: commits' hash list, generated by git log

To pre process the data:

  • filter.py: filter commit messages with keywords and save it to the /traindata
  • getdata2.py: to directly collect the most recent 1000 commits and save them into csv, also does the filter work
  • qemu/: copied from Qemu's repo

Train

To train the model, run the train.py

python train.py

You can also use the parameters to claim the file to train, e. g.

python train.py --data_file=data\traindata\commitmessage\train\qemu2.csv

More parameters can be use, see the source file "train.py" to use.

Test

Test is as simple as the training stage, run the test.py

python test.py

If you want to use your own trained model or test your own data, use the parameters like this

python test.py --test_data_file=data\traindata\commitmessage\evaluate\qemu2.csv --run_dir=runs\1625470566_cm

More parameters can be use, see the source file "test.py" to use.

To test the csv file in the format that first row include three titles like "label", "content", "hash", run the test_file.py

python test_file.py

All predictions are saved into /runs/[model id]/

Get Score

Run the patch_identify to get the score

python patch_identify.py

You can also add new scoring rules or custom the levels by edit the patch_identify.py

To check the result, see the result/ folder.

About

A system demo designed to identify security patch from OSS's commits.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages