You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is a parsing tool based on python for C/C++ to construct code property graph, which is the python version of [CodeParser](https://github.com/for-just-we/CodeParser), most of functions of CppCodeAnalyzer are similar to Joern, the differences are that:
4
+
5
+
- The grammar we utilize here is from the repo of [grammars-v4](https://github.com/antlr/grammars-v4) Antlr official, which means the input of module ast (Antlr AST) is quite different from Joern, but the output customized AST is the same, so the parsing module in ast package is different from Joern.
6
+
7
+
- When constructing CFG, CppCodeAnalyzer takes `for-range` and `try-catch` into consideration.
8
+
9
+
* when parsing code such as `for (auto p: vec){ xxx }`, the CFG is like in graph 1
10
+
11
+
* when parsing `try-catch`, we simple ignore statements in catch block because in normal states they are not going to be executed, and the control flow in `try-catch` is quite hard to compute.
12
+
13
+
* when parsing use-def information by udg package, we take the information of pointer uses. For example, `memcpy(dest, src, 100);` defines symbol `* dest` and uses symbol `* src`, Joern considered pointer define with variable `Tainted` but did not consider pointer uses.
14
+
15
+
Graph 1
16
+
```mermaid
17
+
graph LR
18
+
EmptyCondition --> A[auto p: vec]
19
+
A --> B[xxx]
20
+
B --> EmptyCondition
21
+
EmptyCondition --> Exit
22
+
```
23
+
24
+
The pipeline of CppCodeAnalyzer is similar to Joern, which could be illustrated as:
25
+
26
+
```mermaid
27
+
graph LR
28
+
AntlrAST --Transform --> AST -- control flow analysis --> CFG
29
+
CFG -- dominate analysis --> CDG
30
+
CFG -- symbol def use analysis --> UDG
31
+
UDG -- data dependence analysis --> DDG
32
+
```
33
+
34
+
If you want more details, coule refer to [Joern工具工作流程分析](https://blog.csdn.net/qq_44370676/article/details/125089161)
35
+
36
+
- package ast transform Antlr AST to customized AST.
37
+
38
+
- package cfg conduct control flow analysis and convert customized AST into CFG.
39
+
40
+
- package cdg conduct statement dominate analysis and construct control dependence relations between statements.
41
+
42
+
- package udg analyze the symbols defined and used in each statement independently.
43
+
44
+
- package ddg construct data dependence relations between statements with def-use information computed in udg package.
45
+
46
+
47
+
# Usage
48
+
49
+
The testfile in directionary `test/mainToolTests` illustrated the progress of each module, you could refer to those test cases to learn how to use API in CppCodeAnalyzer.
50
+
51
+
52
+
# Our motivations
53
+
54
+
- When we conduct experiments with Joern tool parsing SARD datasets, we find some error.The statement `wchar_t data[50] = L'A';` should be in a single CFG node, but each token in the statement is assigned to a CFG node, after we check the source code, we believe the root cause is the grammar used by [Joern](https://github.com/octopus-platform/joern/blob/dev/projects/extensions/joern-fuzzyc/src/main/java/antlr/Function.g4#L13).
55
+
56
+
- Also, most researches utilize python to write deep-learning programs, it could be more convenient to parse code with python because the parsing module could directly connect to deep-learning module, there would be no need to write scripts to parse output of Joern.
57
+
58
+
# Challenges
59
+
60
+
- Parsing control-flow in `for-range` and `try-catch` is difficult, there are no materials depicting CFG in `for-range` and `try-catch`.
61
+
62
+
- Parsing def-use information of pointer variable is difficult. For example, in `*(p+i+1) = a[i][j];`, symbols defined include `* p` and used include `p, i, j, a, * a`. However, this is not very accurate, but computing the location of memory staticlly is difficult. This could brings following problems.
63
+
64
+
```cpp
65
+
s1: memset(source, 100, 'A');
66
+
s2: source[99] = '\0';
67
+
s3: memcpy(data, source, 100);
68
+
```
69
+
70
+
- In results of CppCodeAnalyzer, s1 and s2 define symbol `* source` , but the later kills the front. So, there is only DDG edge `s2 -> s3` in DDG.
71
+
72
+
- However, s1 defines `* source`, s2 defines `* ( source + 99)`, a precise DDG should contains edge `s1 -> s3, s2 -> s3`
73
+
74
+
Also, our tool is much more slower than Joern, normally parsing a file in SARD dataset needs 20 - 30 seconds, so we recommand dump output CPG into json format first if you need to train a model.
75
+
76
+
77
+
# Extra Tools
78
+
79
+
The package `extraTools` contains some preprocess code for vulnerability detectors IVDetect, SySeVR and DeepWuKong. The usage could refer to file in `test/extraToolTests`
80
+
81
+
82
+
# References
83
+
84
+
85
+
> [Yamaguchi, F. , Golde, N. , Arp, D. , & Rieck, K. . (2014). Modeling and Discovering Vulnerabilities with Code Property Graphs. IEEE Symposium on Security and Privacy. IEEE.](https://ieeexplore.ieee.org/document/6956589)
86
+
87
+
> [Li Y , Wang S , Nguyen T N . Vulnerability Detection with Fine-grained Interpretations. 2021.](https://arxiv.org/abs/2106.10478)
88
+
89
+
> [SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities\[J\]. IEEE Transactions on Dependable and Secure Computing, 2021, PP(99):1-1.](https://arxiv.org/abs/1807.06756)
90
+
91
+
> [Cheng X , Wang H , Hua J , et al. DeepWukong[J]. ACM Transactions on Software Engineering and Methodology (TOSEM), 2021.](https://dl.acm.org/doi/10.1145/3436877)
0 commit comments