Identify Security Patch in OSS

This project aims to identify security patches(SPs) in the OSS's bulk of commits.

We take 4 steps to implement the target and with our work the Repo's maintainer can handle the SPs easier.

Step 1. The first step is to collect and pre-process the dataset. In this repo, the realistic dataset is from the Qemu's repo in the github and a paper.
Step 2. Train. Use the data set and a CNN-LSTM model to train and validate
Step 3. Test. Test the model in Step 2 and get the result. The Accuracy, Precision, Recall, and F1 score are good.
Step 4. Use the test result we can get the SPs, then we use a list of rules to divide the patches into 4 levels(A, B, C and D) . A means the patch is definitely one of most important SPs, B means important SP, C means normal SP and D means not SP.

With this SP identification system, maintainers can focus on the SPs or important SPs to fix the security bugs and impacts rather than check all commits and find out the SPs. It helps to save time and work more efficiently

Requirements

Python 3.7
Tensorflow = 1.15
scikit-learn = 0.24.2
nltk = 3.6.2
GitPython = 3.1.18
Pygments = 2.3.1

If you want to train the model, please use tensorflow-gpu, or it will take very long time.

Prepare data

collect commit data from OSS in github. In /data.

use git clone to clone the target repo, then use git log --pretty=format:"%h" > 1.txt to save the short IDs of commit.

getdata.py: save most recent 1000 commits
filter.py: filter out the commits with key words
keywordslist.txt: security keywords
hash.txt: commits' hash list, generated by git log

To pre process the data:

filter.py: filter commit messages with keywords and save it to the /traindata
getdata2.py: to directly collect the most recent 1000 commits and save them into csv, also does the filter work
qemu/: copied from Qemu's repo

Train

To train the model, run the train.py

python train.py

You can also use the parameters to claim the file to train, e. g.

python train.py --data_file=data\traindata\commitmessage\train\qemu2.csv

More parameters can be use, see the source file "train.py" to use.

Test

Test is as simple as the training stage, run the test.py

python test.py

If you want to use your own trained model or test your own data, use the parameters like this

python test.py --test_data_file=data\traindata\commitmessage\evaluate\qemu2.csv --run_dir=runs\1625470566_cm

More parameters can be use, see the source file "test.py" to use.

To test the csv file in the format that first row include three titles like "label", "content", "hash", run the test_file.py

python test_file.py

All predictions are saved into /runs/[model id]/

Get Score

Run the patch_identify to get the score

python patch_identify.py

You can also add new scoring rules or custom the levels by edit the patch_identify.py

To check the result, see the result/ folder.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
__pycache__		__pycache__
data		data
result		result
runs/1625470566_cm		runs/1625470566_cm
tokenize		tokenize
README.md		README.md
clstm_classifier.py		clstm_classifier.py
code_keywords.txt		code_keywords.txt
data_helper.py		data_helper.py
data_helper_file.py		data_helper_file.py
patch_identify.py		patch_identify.py
permission.txt		permission.txt
test.py		test.py
test_file.py		test_file.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Identify Security Patch in OSS

Requirements

Prepare data

Train

Test

Get Score

About

Releases

Packages

Languages

hanzhehao123/Identify-Security-Patch

Folders and files

Latest commit

History

Repository files navigation

Identify Security Patch in OSS

Requirements

Prepare data

Train

Test

Get Score

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages