This project is the author’s undergraduate thesis work. The core code is adapted from two open-source repositories: NSSL-SJTU/HermesSim and Cisco-Talos/binary_function_similarity. These were integrated and extended for the task of identifying and locating cryptographic functions in router firmware. This repository contains the corresponding automation scripts and tooling.
Additionally, two datasets were constructed:
Dataset-Finetuning: for fine-tuning the similarity model.Dataset-Crypt: for the cryptographic function identification task. (Put the binaries to be analyzed in theBinaries/Dataset-Crypt/vul/folder following the naming convention.)
Binaries/ # Raw binary files
DBs/ # Preprocessed graph/data outputs
IDA_script/ # IDA Python scripts for extracting ACFG graphs
IDBs/ # IDA analysis database files
bin/ # External tools and dependencies
lifting/ # Scripts for lifting binary functions into Pcode based graphs
preprocess/ # Scripts for graph normalization and encodin
model/ # Neural network model and related experiments configures
postprocess # scripts for testing pairs generation, fast evaluation and visualization
inputs/ # Inputs for the model (iscg, tscg, sog)
outputs/ # Outputs for the model (checkpoint files, inferred embeddings, log)
Dockerfile # An OpenWrt 23.05-specific cross-compilation environment
Notice: External tool gsat-1.0.jar is needed to be downloaded and place in
bin/.The author's intermediate and final experimental results are published in the Releases section.
- Python Environment (Python 3.10 required)
conda create -n cfd python=3.10
conda activate cfd
pip install -r requirements.txt \
--extra-index-url https://download.pytorch.org/whl/cu116 \
-f https://data.pyg.org/whl/torch-1.13.1+cu116.html- IDA Pro Requirement
IDA Pro 9.1 for Linux is required.
Please update the paths in
run_Finetuning.shandrun_Crypt.shaccordingly.
-
Prepare firmware binaries Unpack IoT firmware images and place the target binaries under:
Binaries/Dataset-Crypt/vul/ -
(Optional - already provided) Fine-tune the HermesSim model using:
./run_Finetuning.sh ./run_Finetuning2.sh
-
Update config In
outputs/Finetuned/config.json, set:"checkpoint_name": "checkpoint_*.pt"
to the desired checkpoint (e.g., best-performing one).
-
Run cryptographic function detection
./run_Crypt.sh ./run_Crypt2.sh
-
Outputs
- Fine-tuned checkpoints:
outputs/Finetuned/graph-ggnn-batch_pair-pcode_sog - Detection results:
outputs/Crypt
- Fine-tuned checkpoints:
-
(Optional) To extract top-K most similar functions from output similarity CSVs:
python postprocess/3.pp_results/top_k.py <*_sim.csv> <num_of_results>
Q: I want to avoid matching very small functions. What can I do?
A: Edit the filtering rule in DBs/Dataset-Crypt/Dataset-Crypt_creation.py, line 8:
flowchart = flowchart[flowchart["bb_num"] >= 0]Increase the threshold (e.g., to >= 5) to exclude short functions from matching.
Q: How can I create my own Dataset-Finetuning samples?
A: The source code includes a Dockerfile for building an OpenWrt 23.05-specific cross-compilation environment. You can use it to compile your own binaries for dataset generation:
docker build -t openwrt-crosscompile .
docker run -it --rm -v $(pwd)/src:/workspace openwrt-crosscompileInside the Docker container, place and build your source files under /workspace. The resulting binaries can be used to construct new samples for fine-tuning the model.