A comprehensive graph-based algorithm for software modularization using textual similarity in systems with disconnected call graphs.
GMADC (Graph-based Modularization Algorithm for Disconnected Call Graphs) is a novel algorithm that addresses the critical limitation of traditional graph-based modularization approaches when dealing with disconnected call graphs. By integrating textual similarity analysis with structural dependency information, GMADC significantly improves modularization quality while maintaining computational efficiency.
- Hybrid Approach: Combines structural dependency analysis with textual similarity metrics
- Multiple Text Processing Algorithms: Implements Levenshtein distance and Diff/LCS for comprehensive similarity assessment
- Disconnection Detection: Automatically identifies entities with minimal structural dependencies
- Optimized Performance: Efficient implementation with caching and parallel processing
- Extensible Architecture: Modular design for easy customization and extension
# Clone the repository
git clone https://github.com/mablue/GMADC.git
cd GMADC
# Install dependencies
pip install -r requirements.txtfrom GMADC import GMADC
from callgraph import build_dependency_graph
# Build dependency graph from source code
graph = build_dependency_graph("path/to/source/code")
# Initialize GMADC
gmadc = GMADC()
# Perform modularization
modules = gmadc.modularize(graph)
# Print results
for i, module in enumerate(modules):
print(f"Module {i}: {len(module)} entities")GMADC demonstrates significant improvements over traditional approaches:
| Algorithm | Mozilla Firefox | Apache HTTP Server | Linux Kernel | PostgreSQL |
|---|---|---|---|---|
| Bunch | 0.67 | 0.62 | 0.59 | 0.65 |
| DAGC | 0.71 | 0.68 | 0.63 | 0.69 |
| GMA | 0.75 | 0.72 | 0.68 | 0.73 |
| GMADC | 0.89 | 0.86 | 0.83 | 0.88 |
Table: Modularization Quality Comparison (MoJoFM Scores)
GMADC/
βββ GMADC.py # Main modularization controller
βββ GMADCC.py # Textual similarity computation module
βββ callgraph.py # Dependency graph visualization
βββ similarity.py # Structural similarity calculations
βββ text_processor.py # Text preprocessing and analysis
βββ utils/ # Utility functions and helpers
- Structural Analysis: Traditional graph-based similarity computation using shortest path calculations
- Disconnection Detection: Identification of entities with uniform minimal similarity
- Textual Similarity Assessment: Vocabulary-based analysis using three complementary pillars:
- Class/File Name Similarity (Levenshtein distance)
- Code Description Similarity (Diff/LCS algorithm)
- Function/Variable Name Similarity (Composite identifier analysis)
| Algorithm | Mozilla Firefox | Apache HTTP Server | Linux Kernel | PostgreSQL |
|---|---|---|---|---|
| Bunch | 45.2s | 23.1s | 38.7s | 28.9s |
| GMA | 12.3s | 8.7s | 15.2s | 10.1s |
| GMADC | 18.9s | 13.5s | 22.8s | 15.7s |
GMADC shows near-linear scalability with system size, maintaining approximately 1.5-2Γ overhead compared to pure structural approaches due to text processing.
# Basic modularization
python GMADC.py --input path/to/source --output results/
# Custom parameters
python GMADC.py --input path/to/source --output results/ --weights 0.3 0.4 0.3 --threshold 0.02
# Generate visualization
python GMADC.py --input path/to/source --visualize --format pngfrom GMADC import GMADC
config = {
'similarity_weights': [0.3, 0.4, 0.3], # Name, Description, Identifiers
'disconnection_threshold': 0.02,
'min_module_size': 3,
'max_modules': 20,
'parallel_processing': True
}
gmadc = GMADC(config=config)
modules = gmadc.modularize(graph)The evaluation includes four open-source systems:
| System | Files | KLOC | Disconnected Entities | Domain |
|---|---|---|---|---|
| Mozilla Firefox | 881 | 245 | 23 | Web Browser |
| Apache HTTP Server | 452 | 128 | 15 | Web Server |
| Linux Kernel (subsystem) | 623 | 189 | 31 | Operating System |
| PostgreSQL | 387 | 156 | 12 | Database |
- Python 3.8+
- NetworkX >= 2.6
- NumPy >= 1.21
- NLTK >= 3.6
- Matplotlib >= 3.5 (for visualization)
- Scikit-learn >= 1.0 (for additional metrics)
pip install -r requirements.txtGMADC can be configured through a JSON configuration file:
{
"similarity": {
"weights": [0.3, 0.4, 0.3],
"threshold": 0.02,
"use_caching": true
},
"clustering": {
"min_size": 3,
"max_clusters": 20,
"quality_metric": "mojofm"
},
"performance": {
"parallel_processing": true,
"max_workers": 4
}
}If you use GMADC in your research, please cite our paper:
@article{azizi2024gmadc,
title={GMADC: A Comprehensive Graph-Based Algorithm for Software Modularization Using Textual Similarity in Systems with Disconnected Call Graphs},
author={Azizi, Masoud and Izadkhah, Habib and Isazadeh, Ayaz},
journal={IEEE Transactions on Software Engineering},
year={2024},
publisher={IEEE}
}We welcome contributions! Please see our Contributing Guidelines for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- The contributors to the open-source systems used in our evaluation
- The anonymous reviewers for their valuable feedback
- The research community for their insights and suggestions
For questions and support, please contact:
- Masoud Azizi: [email protected]
- Project Repository: https://github.com/mablue/GMADC
- Pourasghar, B., et al. "A graph-based algorithm for software systems modularization by considering the depth of relationships." Journal of Systems and Software (2020)
- Mitchell, B. S., and Mancoridis, S. "On the automatic modularization of software systems using the bunch tool." IEEE Transactions on Software Engineering (2006)
- Maqbool, O., and Babri, H. "Hierarchical clustering for software architecture recovery." IEEE Transactions on Software Engineering (2007)