Once malicious software has been installed on a system it needs to be told what to do. To receive instructions the malware needs to communicate with a command and control server. Historically, the domain name of the command and control server was hardcoded into the code of the malware. This made it very easy for cybersecurity professionals to block this communication channel by blacklisting the domain once they discovered it.
To avoid this weakness malware programs have started to use Domain Generating Algorithms (or DGAs). These algorithms generate a number of random domain names where one of the generated domains is the control server. The infected computed scans through these domains trying to contact each, eventually it will try the correct domain. At this stage, the malware can receive instructions remotely.
First the given data was explored as documented in EDA Notebook. , with a view to understand:
- The nature of the data.
- Potential required pre-processing steps to sanitize the data.
- Understanding to what extent the classes of
legit
anddga
were present and potential scoring measures. - Identifying potential features to get generated.
Once the data was better understood and potential features had been identified pipelines implement the desired transformations were developed. Then these pipeline were used to generate different features and their ability to discriminates between the classes was explored. Then feature selection was performed using Principal Component Analysis. The process is document in Feature Generation and Selection Notebook
Finally once suitable features had been identified, then model selection was performed. This was achieved by fitting a number of naive models from a variety of model families. Of the model families fitted, Random Forrest was identified as the most suitable. This information was then carried forward to training by using a Grid Search Cross Validation pipeline wrapping around the Random Forrest Classifier as documentd in Model Selecton Notebook
All you need is a Google account, run this repo in a light weight temporary cloud session.
It is assumed you have Python 3.6 or above installed.
Setup scripts are included for Windows and Linux based environments. The setup scripts do the following
- Install virtualenv
- Create a virtualenv
.venv
- Active the virtual env
.venv
- Install the Python packages in
requirements.txt
- Register an IPython kernel as used in the Notebooks
- Run the
unittests
with pytest - Run the
integrationtests
with pytest - Train the model
- Runs a simple test of the model
- Enters an interactive mode where you can query the model
setup\windows.bat
You may need to set executable permission on the bash script.
sudo chmod +x setup\linux.sh
setup\linux.sh
If you've run the setup scripts above you will have a trained model ready for use. However if you choose to setup your environment manually here is an overview of the steps that need to be taken.
The following commands assume you're in the root folder of this git repo.
Windows
python train_model.py -p data\raw\dga_domains.csv -o models
Linux and MacOS
python train_model.py -p data/raw/dga_domains.csv -o models
The model training script expects at least two parameters to be passed in:
-p
for the path to the source data.-o
where the trained model will be written out.
Windows, Linux, and MacOS
python test_model.py
Runs a trivial test on the model to ensure it has been built.
Windows, Linux, and MacOS
For an interactive session, where you can type in domains
python dga_classify.py -i
To get the prediction for a single or comma separated list of domains
python dga_classify.py reddit,facebook.com,google.co.uk
0
- No dga domains were predicted from any of the inputs.2
- No predictions were made, e.g. empty or invalid inputs.3
- Dga domains were predicted
Using the return codes it is possible to call this script as part of executing a shell script.
+---data
| +---interim
| +---processed
| \---raw
The source data is in the raw
sub-directory
+---integrationtests
| +---models
| +---test_00_preprocessing
| +---test_01_feature_generation
| +---test_02_rescale
| +---test_04_prepare_model_inputs
| +---test_05_train_model
| \---test_06_query_model
...
\---unittests
+---data
+---features
| \---transformer
+---pipeline
| \---steps
\---preprocessing
+---column
\---text
+---src
| +---data
| +---features
| | \---transformer
| +---logging
| +---model
| +---pipeline
| | \---steps
| \---preprocessing
| \---transformer
Each of the sub-directories is exposed as a Python Package.
For loading data.
Concerned with feature generation, and the sklearn compatible transformer pipelines that implement these feature generators.
Helper package for using built-in logging features.
Training, testing, loading, and "querying" the model we've derived from our source data.
Separated into complete pipelines and steps for use in pipelines.
Transformations for getting input data into usable state.
+---models
+---notebooks
+---scripts
+---setup
This is where built models can be stored. This folder is excluded from git.
Once a model has been trained the following files are written to this folder:
report_test.txt
- report of test prediction detailing precision, recall, f1-score.report_train.txt
- report of training prediction detailing precision, recall, f1-score.trained.model
- the persisted model.trained_params.json
- the optimal parameters are found via Cross Validated Grid Search.training.log
- log file of the training process.
Documenting the phases of:
- Exploratory Data Analysis
- Feature Generation and Selection
- Model Selection
Helper scripts used in development.
Setup scripts.