Skip to content
/ GMADC Public

Provide a vocabulary-based algorithm for measuring the clustering of software systems with disconnected call graphs

License

Notifications You must be signed in to change notification settings

mablue/GMADC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GMADC: Graph-based Modularization Algorithm for Disconnected Call Graphs

License: MIT Python 3.8+ DOI

A comprehensive graph-based algorithm for software modularization using textual similarity in systems with disconnected call graphs.

πŸ“– Overview

GMADC (Graph-based Modularization Algorithm for Disconnected Call Graphs) is a novel algorithm that addresses the critical limitation of traditional graph-based modularization approaches when dealing with disconnected call graphs. By integrating textual similarity analysis with structural dependency information, GMADC significantly improves modularization quality while maintaining computational efficiency.

Key Features

  • Hybrid Approach: Combines structural dependency analysis with textual similarity metrics
  • Multiple Text Processing Algorithms: Implements Levenshtein distance and Diff/LCS for comprehensive similarity assessment
  • Disconnection Detection: Automatically identifies entities with minimal structural dependencies
  • Optimized Performance: Efficient implementation with caching and parallel processing
  • Extensible Architecture: Modular design for easy customization and extension

πŸš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/mablue/GMADC.git
cd GMADC

# Install dependencies
pip install -r requirements.txt

Basic Usage

from GMADC import GMADC
from callgraph import build_dependency_graph

# Build dependency graph from source code
graph = build_dependency_graph("path/to/source/code")

# Initialize GMADC
gmadc = GMADC()

# Perform modularization
modules = gmadc.modularize(graph)

# Print results
for i, module in enumerate(modules):
    print(f"Module {i}: {len(module)} entities")

πŸ“Š Results

GMADC demonstrates significant improvements over traditional approaches:

Algorithm Mozilla Firefox Apache HTTP Server Linux Kernel PostgreSQL
Bunch 0.67 0.62 0.59 0.65
DAGC 0.71 0.68 0.63 0.69
GMA 0.75 0.72 0.68 0.73
GMADC 0.89 0.86 0.83 0.88

Table: Modularization Quality Comparison (MoJoFM Scores)

πŸ—οΈ System Architecture

GMADC/
β”œβ”€β”€ GMADC.py              # Main modularization controller
β”œβ”€β”€ GMADCC.py             # Textual similarity computation module
β”œβ”€β”€ callgraph.py          # Dependency graph visualization
β”œβ”€β”€ similarity.py         # Structural similarity calculations
β”œβ”€β”€ text_processor.py     # Text preprocessing and analysis
└── utils/                # Utility functions and helpers

Core Components

  1. Structural Analysis: Traditional graph-based similarity computation using shortest path calculations
  2. Disconnection Detection: Identification of entities with uniform minimal similarity
  3. Textual Similarity Assessment: Vocabulary-based analysis using three complementary pillars:
    • Class/File Name Similarity (Levenshtein distance)
    • Code Description Similarity (Diff/LCS algorithm)
    • Function/Variable Name Similarity (Composite identifier analysis)

πŸ“ˆ Performance

Execution Time Comparison

Algorithm Mozilla Firefox Apache HTTP Server Linux Kernel PostgreSQL
Bunch 45.2s 23.1s 38.7s 28.9s
GMA 12.3s 8.7s 15.2s 10.1s
GMADC 18.9s 13.5s 22.8s 15.7s

Scalability

GMADC shows near-linear scalability with system size, maintaining approximately 1.5-2Γ— overhead compared to pure structural approaches due to text processing.

πŸ› οΈ Usage Examples

Command Line Interface

# Basic modularization
python GMADC.py --input path/to/source --output results/

# Custom parameters
python GMADC.py --input path/to/source --output results/ --weights 0.3 0.4 0.3 --threshold 0.02

# Generate visualization
python GMADC.py --input path/to/source --visualize --format png

Advanced Configuration

from GMADC import GMADC

config = {
    'similarity_weights': [0.3, 0.4, 0.3],  # Name, Description, Identifiers
    'disconnection_threshold': 0.02,
    'min_module_size': 3,
    'max_modules': 20,
    'parallel_processing': True
}

gmadc = GMADC(config=config)
modules = gmadc.modularize(graph)

πŸ“š Datasets

The evaluation includes four open-source systems:

System Files KLOC Disconnected Entities Domain
Mozilla Firefox 881 245 23 Web Browser
Apache HTTP Server 452 128 15 Web Server
Linux Kernel (subsystem) 623 189 31 Operating System
PostgreSQL 387 156 12 Database

πŸ“‹ Requirements

Python Dependencies

  • Python 3.8+
  • NetworkX >= 2.6
  • NumPy >= 1.21
  • NLTK >= 3.6
  • Matplotlib >= 3.5 (for visualization)
  • Scikit-learn >= 1.0 (for additional metrics)

Installation

pip install -r requirements.txt

πŸ”§ Configuration

GMADC can be configured through a JSON configuration file:

{
    "similarity": {
        "weights": [0.3, 0.4, 0.3],
        "threshold": 0.02,
        "use_caching": true
    },
    "clustering": {
        "min_size": 3,
        "max_clusters": 20,
        "quality_metric": "mojofm"
    },
    "performance": {
        "parallel_processing": true,
        "max_workers": 4
    }
}

πŸ“ Citation

If you use GMADC in your research, please cite our paper:

@article{azizi2024gmadc,
  title={GMADC: A Comprehensive Graph-Based Algorithm for Software Modularization Using Textual Similarity in Systems with Disconnected Call Graphs},
  author={Azizi, Masoud and Izadkhah, Habib and Isazadeh, Ayaz},
  journal={IEEE Transactions on Software Engineering},
  year={2024},
  publisher={IEEE}
}

🀝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • The contributors to the open-source systems used in our evaluation
  • The anonymous reviewers for their valuable feedback
  • The research community for their insights and suggestions

πŸ“ž Contact

For questions and support, please contact:

πŸ“š Related Publications

  1. Pourasghar, B., et al. "A graph-based algorithm for software systems modularization by considering the depth of relationships." Journal of Systems and Software (2020)
  2. Mitchell, B. S., and Mancoridis, S. "On the automatic modularization of software systems using the bunch tool." IEEE Transactions on Software Engineering (2006)
  3. Maqbool, O., and Babri, H. "Hierarchical clustering for software architecture recovery." IEEE Transactions on Software Engineering (2007)

About

Provide a vocabulary-based algorithm for measuring the clustering of software systems with disconnected call graphs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published