Skip to content

Commit 7dd3b16

Browse files
committed
First version of the full prototype (added datagen part).
1 parent d952ceb commit 7dd3b16

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+6225
-80
lines changed

.gitignore

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
data

LIBEXP_CACHE.json.zip

8.9 MB
Binary file not shown.

README.md

+108-18
Original file line numberDiff line numberDiff line change
@@ -4,10 +4,58 @@ This is the official implementation of `Nero-GNN`, the prototype described in: [
44

55
Our evaluation dataset and other resources are available [here](https://doi.org/10.5281/zenodo.4081641) (Zenodo). These will be used and further explained next.
66

7-
<center style="padding: 40px"><img width="90%" src="https://raw.githubusercontent.com/tech-srl/Nero/main/images/ACSG.png" /></center>
7+
![An overview of the data-gen process](https://github.com/tech-srl/Nero/blob/main/images/ACSG.png?raw=true "Data-generation Process")
88

9-
## Requirements
10-
* [python3.6](https://www.linuxbabe.com/ubuntu/install-python-3-6-ubuntu-16-04-16-10-17-04)
9+
This prototype is composed of two parts:
10+
1. [Data generation](#generating-representations-for-binary-procedures): used to generate the procedure representations for the model.
11+
1. [GNN neural model](#predicting-procedure-names-using-neural-models): uses the procedure representations to train a GNN model for predicting procedure names.
12+
13+
## Requirements
14+
15+
### Data Generation Specific Requirements
16+
17+
* [python3.8](https://www.python.org/downloads/)
18+
* [LLVM version 10](https://llvm.org/docs/GettingStarted.html) and the llvmlite & llvmcpy python packages (other versions might work. 3.x will not).
19+
* [IDA-PRO](https://www.hex-rays.com/products/ida/) (tested with version 6.95).
20+
* [angr](http://angr.io), and the simuvex package.
21+
* A few more python packages: scandir, tqdm, jsonpickle, parmap, python-magic, pyelftools, setproctitle.
22+
23+
Using a licensed IDA-PRO installation for Linux, all of these requirements were verified as compatible for running on an Ubuntu 20 machine (and with some more effort even on Ubuntu 16).
24+
25+
For Ubuntu 20, you can use the `requirements.txt` file in this repository to install all python packages against the native python3.8 version:
26+
27+
```bash
28+
pip3 install -r requirements.txt
29+
```
30+
31+
LLVM version 10 can be installed with:
32+
```bash
33+
sudo apt get install llvm-10
34+
```
35+
36+
The IDA-python scripts (in `datagen/ida/py2`) were tested against the python 2.7 version bundled with IDA-PRO 6.95, and should work with newer versions at least up-to 7.4 (more info [here](https://www.hex-rays.com/products/ida/support/ida74_idapython_python3.shtml)). Please [file a bug](https://github.com/tech-srl/Nero/issues) if it doesn't.
37+
38+
The jsonpickle python package also needs to be installed for use by this bundled python version:
39+
40+
1. Download the package:
41+
```bash
42+
wget https://files.pythonhosted.org/packages/32/d5/2f47f03d3f64c31b0d7070b488274631d7567c36e81a9f744e6638bb0f0d/jsonpickle-0.9.6.tar.gz
43+
```
44+
2. Extract only the package sources:
45+
```bash
46+
tar -xvf jsonpickle-0.9.6.tar.gz jsonpickle-0.9.6/jsonpickle/
47+
```
48+
3. Move it to the IDA-PRO python directory:
49+
```bash
50+
mv jsonpickle-0.9.6/jsonpickle /opt/ida-6.95/idal64/python/
51+
```
52+
53+
Note that, when installed as root, IDA-PRO defaults to installing in `/opt/ida-6.95/idal64`. Other paths will require adjusting here and in other scripts.
54+
55+
### Neural Model Specific Requirements
56+
57+
* [python3.6](https://www.python.org/downloads/). (For using the same Ubuntu 20 machine for training and data generation we recommend using [virtualenv](http://thomas-cokelaer.info/blog/2014/08/installing-another-python-version-into-virtualenv/))
58+
* These two python packages: jsonpickle, scipy
1159
* TensorFlow 1.13.1 ([install](https://www.tensorflow.org/install/install_linux)) or using:
1260

1361
```bash
@@ -29,16 +77,56 @@ python3 -c 'import tensorflow as tf; print(tf.__version__)'
2977

3078
## Generating Representations for Binary Procedures
3179

32-
[Our dataset](https://zenodo.org/record/4099685/files/nero_dataset_binaries.tar.gz) was created by compiling several GNU source-code packages into binary executables.
33-
The packages are split into three sets: training, validation and test (each in its own directory in the extracted archive).
80+
[Our binaries dataset](https://zenodo.org/record/4099685/files/nero_dataset_binaries.tar.gz) was created by compiling several GNU source-code packages into binary executables and performing a thorough cleanup and deduplication process (detailed in [our paper](https://arxiv.org/abs/1902.09122)).
81+
82+
The packages are split into three sets: training, validation and test (each in its own directory in the extracted archive: `TRAIN`, `VALIDATE` & `TEST` resp.).
83+
84+
To obtain preprocessed representations for these binaries you can either download our preprocessed dataset, or create a new dataset from our or any other binaries dataset.
85+
86+
### Creating Representations
3487

35-
| :construction: | We are working on sharing a stable and easy to use version of our binary representations generation system. <BR> Stay tuned for updates on this repository. | :construction: |
36-
|---------------|:------------------------:|---------------|
88+
#### Indexing
3789

90+
Indexing, i.e., analyzing the binaries and creating augmented control flow graphs based representations for them is performed using:
3891

39-
Performing a thorough cleanup and deduplication process (detailed in
40-
[our paper](https://arxiv.org/abs/1902.09122)) resulted in a dataset containing
41-
67,246 samples. The procedure representations for these samples can be found
92+
```bash
93+
python3 -u index_binaries.py --input-dir TRAIN --output-dir TRAIN_INDEXED
94+
```
95+
96+
where `TRAIN` is the directory holding the binaries to index, and results are placed in `TRAIN_INDEXED`.
97+
98+
To index successfully, binaries must contain debug information and adhere to this file name structure:
99+
```
100+
<compiler>-<compiler version>__O<Optimization level(u for default)>__<Package name>[-<optional package version>]__<Executable name>
101+
```
102+
For example: "gcc-5__Ou__cssc__sccs".
103+
104+
Note that the indexing process might take several hours, and some of its results depend on the timeout value selected for procedure indexing (controlled by `--index-timeout` with the default of 30 minutes). We recommend running it on a machine with multiple CPU-cores and adequate RAM. Procedure indexing will also stop if more than 1000 unique CFG paths are extracted.
105+
106+
To change the path to the IDA-PRO installation use `--idal64-path`.
107+
108+
#### Filter and collect
109+
110+
Next, to filter and collect all the indexed procedures into one JSON file:
111+
```bash
112+
python3 -u collect_and_filter.py --input-dir TRAIN_INDEXED --output-file=train.json
113+
```
114+
115+
This will filter and collect indexed procedures from `TRAIN_INDEXED` (which should hold the indexed binaries for training from the last step) and store them in `train.json`.
116+
117+
#### Preprocess for use by the model
118+
119+
Finally, to preprocess raw representations, preparing them for use by the neural model, use:
120+
121+
```bash
122+
python3 preprocess.py -trd train.json -ted test.json -vd validation.json -o data
123+
```
124+
125+
This will preprocess the training(`train.json`), validation(`validation.json`) and test(`test.json`) files. Note that this step require TensorFlow and other components mentioned [here](#neural-model-specific-requirements).
126+
127+
### Using Prepared Representations
128+
129+
The procedure representations for the binaries in our dataset can be found
42130
[in this archive](https://zenodo.org/record/4095276/files/procedure_representations.tar.gz).
43131

44132
Extracting the procedure representations archive will create the folder `procedure_representations` and inside it two more folders:
@@ -50,15 +138,15 @@ The `preprocessed` directory contains:
50138
1. `data.val` - The (preprocessed) validation set samples.
51139
1. `data.test` - The (preprocessed) test set samples.
52140

53-
## Training New Models
141+
## Predicting Procedure Names Using Neural Models
54142

55143
As we show in [our paper](https://arxiv.org/abs/1902.09122), `Nero-GNN` is the best variation of our approach, and so we focus on and showcase it here.
56144

57-
### Training from scratch
145+
### Training From Scratch
58146

59147
Training a `Nero-GNN` model is performed by running the following command line:
60148
```bash
61-
python3 -u nero.py --data procedure_representations/processed/data \
149+
python3 -u gnn.py --data procedure_representations/processed/data \
62150
--test procedure_representations/processed/data.val --save new_model/model \
63151
--gnn_layers NUM_GNN_LAYERS
64152
```
@@ -67,7 +155,9 @@ Where `NUM_GNN_LAYERS` is the number of GNN layers. In the paper, we found `NUM_
67155
The paths to the (training) `--data` and (validation) `--test` arguments can be changed to point to a new dataset.
68156
Here, we provide the dataset that [we used in the paper](#generating-representations-for-binary-procedures).
69157

70-
### Trained models
158+
We trained our models using a `Tesla V100` GPU. Other GPUs might require changing the number of GNN layers or other dims to fit into the available RAM.
159+
160+
### Using Pre-Trained Models
71161

72162
Trained models are available [in this archive](https://zenodo.org/record/4095276/files/nero_gnn_model.tar.gz).
73163
Extracting it will create the `gnn` directory composed of:
@@ -79,7 +169,7 @@ Extracting it will create the `gnn` directory composed of:
79169

80170
Evaluation of a trained model is performed using the following command line:
81171
```bash
82-
python3 -u nero.py --test procedure_representations/data.test \
172+
python3 -u gnn.py --test procedure_representations/data.test \
83173
--load gnn/model_iter495 \
84174
--gnn_layers NUM_GNN_LAYERS
85175
```
@@ -92,17 +182,17 @@ The value of `NUM_GNN_LAYERS` should be the same as in training.
92182
* Use the `--no_api` flag during training **and** testing, to train an "obfuscated" model (as in Table 2 in [our paper](https://arxiv.org/abs/1902.09122)) - a model that does not use the API names (assuming they are obfuscated).
93183

94184

95-
### Understanding the prediction process and its results
185+
### Understanding the Prediction Process and Its Results
96186

97-
This section provides a name prediction walkthrough for an example from our test set ([further explained here](#generating-representations-for-binary-procedures).
187+
This section provides a name prediction walk-through for an example from our test set ([further explained here](#generating-representations-for-binary-procedures).
98188
For readability, we start straight from the graph representation (similar to the one depicted in Fig.2(c) in [our paper](https://arxiv.org/abs/1902.09122)) and skip the rest of the steps.
99189

100190
The `get_tz` procedure from the `find` executable is part of `findutils` package.
101191
This procedure is represented as a json found at line 1715 in `procedure_representations/raw/test.json`.
102192

103193
This json can be pretty-printed by running:
104194
```bash
105-
awk 'NR==1715' procedure_representations/raw/test.json | python -m json.tool
195+
awk 'NR==1715' procedure_representations/raw/test.json | python3 -m json.tool
106196
```
107197

108198
This json represents the procedure's graph:

0 commit comments

Comments
 (0)