A new scalable architecture for the early identification of new Tor sites and the daily analysis of their content. The solution is built using Big Data technologies (Kubernetes, Kafka, Kubeflow, and MinIO), continuously discovering onion services in different sources (threat intelligence, code repositories web-Tor gateways and Tor repositories), deduplicating them using MinHash LSH, and categorizing with the BERTopic topic modeling.
The repository is organized as follows:
management/: contains the management services for the architecturelocal.env.example: environment variables for the management servicesdashboard/: kubernetes dashboard servicegitlab-agent/: gitlab-agent for integration of Kubernetes and GitLab CI/CDgitlab-runner/: gitlab-runner for CI/CD pipelinesgrafana/: grafana service for monitoringpromehteus/: prometheus service for monitoringnfs-provisioner/: NFS provider service for persistent volumes
applications: contains the application-level services for the dark web monitoringlocal.env.example: environment variables for the application-level servicesdata_sources/: data sources (threat intelligence, code repositories web-Tor gateways and Tor repositories) configured in Crawlabcrawlab/: crawlab service for the data ingestion and crawler databasedownloaders/: service for downloading Tor HTML pages and downloader databasetorproxy/: tor proxy for downloading Tor HTML pageskafka/: kafka service for streamingmlops/: kubeflow service for the daily batch processingminio/: minio service for the object data storagejobs/: kubernetes jobs for architecture configuration
The project requires the following dependencies:
Python 3Docker 20.10.21Kubernetes 1.22.12(Kubernetes cluster already working)
By executing the CI/CD pipeline (.gitlab-ci.yml), the architecture will be deployed automatically on your Kubernetes cluster. Beforehand, the gitlab-agent (management/gitlab-agent) and runners (management/gitlab-runner) should be configured. Read the official documentation of GitLab Agent and GitLab Runners.
The architecture is used through:
- Crawlab interface for the data ingestion
- Kubernetes client and dashboard for data engineering instructions and monitoring
- Kubeflow interface for configuring the daily batch data processing

