You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+108-18
Original file line number
Diff line number
Diff line change
@@ -4,10 +4,58 @@ This is the official implementation of `Nero-GNN`, the prototype described in: [
4
4
5
5
Our evaluation dataset and other resources are available [here](https://doi.org/10.5281/zenodo.4081641) (Zenodo). These will be used and further explained next.
1.[Data generation](#generating-representations-for-binary-procedures): used to generate the procedure representations for the model.
11
+
1.[GNN neural model](#predicting-procedure-names-using-neural-models): uses the procedure representations to train a GNN model for predicting procedure names.
12
+
13
+
## Requirements
14
+
15
+
### Data Generation Specific Requirements
16
+
17
+
*[python3.8](https://www.python.org/downloads/)
18
+
*[LLVM version 10](https://llvm.org/docs/GettingStarted.html) and the llvmlite & llvmcpy python packages (other versions might work. 3.x will not).
19
+
*[IDA-PRO](https://www.hex-rays.com/products/ida/) (tested with version 6.95).
20
+
*[angr](http://angr.io), and the simuvex package.
21
+
* A few more python packages: scandir, tqdm, jsonpickle, parmap, python-magic, pyelftools, setproctitle.
22
+
23
+
Using a licensed IDA-PRO installation for Linux, all of these requirements were verified as compatible for running on an Ubuntu 20 machine (and with some more effort even on Ubuntu 16).
24
+
25
+
For Ubuntu 20, you can use the `requirements.txt` file in this repository to install all python packages against the native python3.8 version:
26
+
27
+
```bash
28
+
pip3 install -r requirements.txt
29
+
```
30
+
31
+
LLVM version 10 can be installed with:
32
+
```bash
33
+
sudo apt get install llvm-10
34
+
```
35
+
36
+
The IDA-python scripts (in `datagen/ida/py2`) were tested against the python 2.7 version bundled with IDA-PRO 6.95, and should work with newer versions at least up-to 7.4 (more info [here](https://www.hex-rays.com/products/ida/support/ida74_idapython_python3.shtml)). Please [file a bug](https://github.com/tech-srl/Nero/issues) if it doesn't.
37
+
38
+
The jsonpickle python package also needs to be installed for use by this bundled python version:
Note that, when installed as root, IDA-PRO defaults to installing in `/opt/ida-6.95/idal64`. Other paths will require adjusting here and in other scripts.
54
+
55
+
### Neural Model Specific Requirements
56
+
57
+
*[python3.6](https://www.python.org/downloads/). (For using the same Ubuntu 20 machine for training and data generation we recommend using [virtualenv](http://thomas-cokelaer.info/blog/2014/08/installing-another-python-version-into-virtualenv/))
58
+
* These two python packages: jsonpickle, scipy
11
59
* TensorFlow 1.13.1 ([install](https://www.tensorflow.org/install/install_linux)) or using:
## Generating Representations for Binary Procedures
31
79
32
-
[Our dataset](https://zenodo.org/record/4099685/files/nero_dataset_binaries.tar.gz) was created by compiling several GNU source-code packages into binary executables.
33
-
The packages are split into three sets: training, validation and test (each in its own directory in the extracted archive).
80
+
[Our binaries dataset](https://zenodo.org/record/4099685/files/nero_dataset_binaries.tar.gz) was created by compiling several GNU source-code packages into binary executables and performing a thorough cleanup and deduplication process (detailed in [our paper](https://arxiv.org/abs/1902.09122)).
81
+
82
+
The packages are split into three sets: training, validation and test (each in its own directory in the extracted archive: `TRAIN`, `VALIDATE` & `TEST` resp.).
83
+
84
+
To obtain preprocessed representations for these binaries you can either download our preprocessed dataset, or create a new dataset from our or any other binaries dataset.
85
+
86
+
### Creating Representations
34
87
35
-
|:construction:| We are working on sharing a stable and easy to use version of our binary representations generation system. <BR> Stay tuned for updates on this repository. |:construction:|
where `TRAIN` is the directory holding the binaries to index, and results are placed in `TRAIN_INDEXED`.
97
+
98
+
To index successfully, binaries must contain debug information and adhere to this file name structure:
99
+
```
100
+
<compiler>-<compiler version>__O<Optimization level(u for default)>__<Package name>[-<optional package version>]__<Executable name>
101
+
```
102
+
For example: "gcc-5__Ou__cssc__sccs".
103
+
104
+
Note that the indexing process might take several hours, and some of its results depend on the timeout value selected for procedure indexing (controlled by `--index-timeout` with the default of 30 minutes). We recommend running it on a machine with multiple CPU-cores and adequate RAM. Procedure indexing will also stop if more than 1000 unique CFG paths are extracted.
105
+
106
+
To change the path to the IDA-PRO installation use `--idal64-path`.
107
+
108
+
#### Filter and collect
109
+
110
+
Next, to filter and collect all the indexed procedures into one JSON file:
This will filter and collect indexed procedures from `TRAIN_INDEXED` (which should hold the indexed binaries for training from the last step) and store them in `train.json`.
116
+
117
+
#### Preprocess for use by the model
118
+
119
+
Finally, to preprocess raw representations, preparing them for use by the neural model, use:
120
+
121
+
```bash
122
+
python3 preprocess.py -trd train.json -ted test.json -vd validation.json -o data
123
+
```
124
+
125
+
This will preprocess the training(`train.json`), validation(`validation.json`) and test(`test.json`) files. Note that this step require TensorFlow and other components mentioned [here](#neural-model-specific-requirements).
126
+
127
+
### Using Prepared Representations
128
+
129
+
The procedure representations for the binaries in our dataset can be found
42
130
[in this archive](https://zenodo.org/record/4095276/files/procedure_representations.tar.gz).
43
131
44
132
Extracting the procedure representations archive will create the folder `procedure_representations` and inside it two more folders:
@@ -50,15 +138,15 @@ The `preprocessed` directory contains:
50
138
1.`data.val` - The (preprocessed) validation set samples.
51
139
1.`data.test` - The (preprocessed) test set samples.
52
140
53
-
## Training New Models
141
+
## Predicting Procedure Names Using Neural Models
54
142
55
143
As we show in [our paper](https://arxiv.org/abs/1902.09122), `Nero-GNN` is the best variation of our approach, and so we focus on and showcase it here.
56
144
57
-
### Training from scratch
145
+
### Training From Scratch
58
146
59
147
Training a `Nero-GNN` model is performed by running the following command line:
@@ -67,7 +155,9 @@ Where `NUM_GNN_LAYERS` is the number of GNN layers. In the paper, we found `NUM_
67
155
The paths to the (training) `--data` and (validation) `--test` arguments can be changed to point to a new dataset.
68
156
Here, we provide the dataset that [we used in the paper](#generating-representations-for-binary-procedures).
69
157
70
-
### Trained models
158
+
We trained our models using a `Tesla V100` GPU. Other GPUs might require changing the number of GNN layers or other dims to fit into the available RAM.
159
+
160
+
### Using Pre-Trained Models
71
161
72
162
Trained models are available [in this archive](https://zenodo.org/record/4095276/files/nero_gnn_model.tar.gz).
73
163
Extracting it will create the `gnn` directory composed of:
@@ -79,7 +169,7 @@ Extracting it will create the `gnn` directory composed of:
79
169
80
170
Evaluation of a trained model is performed using the following command line:
@@ -92,17 +182,17 @@ The value of `NUM_GNN_LAYERS` should be the same as in training.
92
182
* Use the `--no_api` flag during training **and** testing, to train an "obfuscated" model (as in Table 2 in [our paper](https://arxiv.org/abs/1902.09122)) - a model that does not use the API names (assuming they are obfuscated).
93
183
94
184
95
-
### Understanding the prediction process and its results
185
+
### Understanding the Prediction Process and Its Results
96
186
97
-
This section provides a name prediction walkthrough for an example from our test set ([further explained here](#generating-representations-for-binary-procedures).
187
+
This section provides a name prediction walk-through for an example from our test set ([further explained here](#generating-representations-for-binary-procedures).
98
188
For readability, we start straight from the graph representation (similar to the one depicted in Fig.2(c) in [our paper](https://arxiv.org/abs/1902.09122)) and skip the rest of the steps.
99
189
100
190
The `get_tz` procedure from the `find` executable is part of `findutils` package.
101
191
This procedure is represented as a json found at line 1715 in `procedure_representations/raw/test.json`.
0 commit comments