This project applies advanced machine learning methods to predict critical mechanical properties and microstructural characteristics of welded joints, enabling comprehensive weld quality assessment. The methodology addresses the challenge of sparse labeled data through supervised and semi-supervised learning approaches.
- Total Samples: 1,654 welded joints
- Input Features: 52 variables
- Chemical composition (C, Si, Mn, P, S, Cr, Mo, Ni, etc.)
- Welding parameters (Heat Input, Interpass Temperature, PWHT, etc.)
- Process variables (Electrode Type, Polarity, Weld Type, etc.)
- Target Properties: 13 mechanical and microstructural properties
- Group 1 (16-30% data availability): Supervised learning
- Group 2 (2-8% data availability): Semi-supervised learning
The 13 target properties are divided into two groups based on data availability, requiring fundamentally different machine learning strategies:
| Group | Properties | Availability | Approach |
|---|---|---|---|
| Group 1 | Yield Strength, UTS, Elongation, Reduction Area, Charpy Energy, Charpy Temperature | 16-30% | Supervised Learning + PCA |
| Group 2 | Hardness, FATT 50%, Primary Ferrite, Ferrite 2nd Phase, Acicular Ferrite, Martensite, Ferrite Carbide | 2-8% | Semi-Supervised Learning |
Predict mechanical properties with abundant labeled data (16-30% availability, ~264-500 samples per property).
With 52 input features and sufficient labeled data:
- Curse of dimensionality: High-dimensional feature space can lead to overfitting
- Multicollinearity: Chemical composition features are highly correlated
- Computational efficiency: Reduced feature space accelerates training
- Variance retention: PCA preserves 95% of original variance while reducing dimensions
-
Data Preprocessing:
- StandardScaler: Zero mean, unit variance normalization
- KNNImputer: Missing value imputation (k=5 neighbors, distance weighting)
-
Principal Component Analysis:
- Compute covariance matrix of standardized features
- Extract eigenvectors and eigenvalues
- Select components explaining ≥95% cumulative variance
- Typical reduction: 52 features → 15-25 components
-
Outputs:
- Transformed dataset:
welddb_pca_[property].csv - PCA model:
pca_model/pca_transformer.pkl - Scaler:
pca_model/scaler.pkl - Explained variance plot
- Transformed dataset:
-
Models Evaluated (9 algorithms):
- Linear: Ridge, Lasso, ElasticNet
- Tree-based: Decision Tree, Random Forest, Gradient Boosting, XGBoost, LightGBM
- Kernel: Support Vector Regression (RBF kernel)
-
Hyperparameter Optimization:
- GridSearchCV with 5-fold cross-validation
- Scoring metric: R² (coefficient of determination)
- Parallel processing (n_jobs=-1)
-
Evaluation Metrics:
- R²: Proportion of variance explained
- Adjusted R²: R² penalized for feature count
- RMSE: Root Mean Squared Error
- MAE: Mean Absolute Error
-
Outputs:
- Best model:
trained_models/best_[property]_model.pkl - Performance comparison:
trained_models/model_comparison.csv - Prediction visualizations
- Best model:
| Property | Best Model | R² Score | RMSE | Status |
|---|---|---|---|---|
| Yield Strength | XGBoost | ~0.92 | ~45 MPa | ✓ Excellent |
| UTS | Random Forest | ~0.90 | ~52 MPa | ✓ Excellent |
| Elongation | Gradient Boosting | ~0.88 | ~3.2% | ✓ Good |
| Reduction Area | XGBoost | ~0.86 | ~4.5% | ✓ Good |
| Charpy Energy | Random Forest | ~0.84 | ~18 J | ✓ Good |
| Charpy Temperature | LightGBM | ~0.82 | ~12°C | ✓ Good |
Predict properties with extremely sparse labeled data (2-8% availability, only 31-138 samples per property).
Traditional supervised learning fails with sparse data:
- High variance: Unreliable estimates with <10% labeled samples
- Overfitting: 52 features overwhelm limited training data
- Poor generalization: Cannot capture complex patterns
Solution: Leverage ~1,500 unlabeled samples through self-training.
Unlike Group 1, we do NOT apply PCA because:
- Insufficient samples: Cannot reliably estimate 52×52 covariance matrix with 31-138 samples
- Information preservation: With limited labeled data, we cannot afford to discard any variance
- Implicit regularization: Self-training using unlabeled data provides regularization
- Physical interpretability: Original features maintain metallurgical meaning
1. Train base model on labeled data L
2. Predict unlabeled samples U with confidence estimation
3. Select top 15% most confident predictions as pseudo-labels
4. Add pseudo-labels to training set: L' = L ∪ pseudo-labels
5. Retrain model on augmented set L'
6. Repeat for max 10 iterations
Random Forest:
- Variance across ensemble trees → confidence score
- High variance = low confidence (uncertain prediction)
- Low variance = high confidence (reliable pseudo-label)
Formula: confidence = 1 / (1 + prediction_variance)
-
SelfTrainingRegressor:
- Sklearn-compatible wrapper for any base regressor
- Handles NaN targets (indicating unlabeled samples)
- Supports GridSearchCV hyperparameter optimization
- Logs iteration metrics for transparency
-
CustomLabeledUnlabeledKFold:
- K-Fold only on labeled data
- Training folds: labeled (fold) + ALL unlabeled samples
- Validation folds: labeled (fold) ONLY
- Ensures proper semi-supervised evaluation
-
Feature Selection:
- Exclude other Group 2 properties (prevent data leakage)
- Exclude Group 1 properties (already predicted)
- Retain all 52 original features
-
Normalization: MinMaxScaler (0-1 range)
- Bounded range prevents outlier dominance
- Compatible with distance-based imputation
-
Imputation: KNNImputer (k=5, distance weighting)
- Preserves local feature space structure
-
Baseline (Supervised):
- Random Forest with GridSearchCV
- XGBoost with GridSearchCV
- Train only on labeled data
-
Semi-Supervised:
- Random Forest + SelfTraining
- XGBoost + SelfTraining
- Train on labeled + unlabeled (with pseudo-labels)
-
Comparison:
- Evaluate all 4 models on test set
- Select best based on R² score
- Typical improvement: +5% to +20% in R²
- Comprehensive notebook for Hardness prediction
- Detailed iteration logging and visualization
- Distribution comparison (original vs predicted)
- Sequential training for all 6 targets (excluding Hardness)
- Processes: FATT 50%, Primary Ferrite, Ferrite 2nd Phase, Acicular Ferrite, Martensite, Ferrite Carbide
- Generates 24 models (4 per target)
- Comprehensive performance summary across all targets
| Property | Labeled Samples | Best Model | R² Score | Improvement* |
|---|---|---|---|---|
| Hardness | 138 (8.4%) | RF Semi-Supervised | ~0.85 | +12% |
| Primary Ferrite | 138 (8.4%) | XGB Semi-Supervised | ~0.82 | +15% |
| Acicular Ferrite | 120 (7.3%) | RF Semi-Supervised | ~0.78 | +18% |
| Martensite | 110 (6.7%) | XGB Semi-Supervised | ~0.76 | +14% |
| Ferrite Carbide | 105 (6.4%) | RF Semi-Supervised | ~0.74 | +16% |
| Ferrite 2nd Phase | 100 (6.1%) | XGB Semi-Supervised | ~0.72 | +10% |
| FATT 50% | 31 (1.9%) | RF Semi-Supervised | ~0.58 | +8% |
*Improvement over supervised baseline
Weld quality is not a single property but a balance between multiple mechanical characteristics:
- Strength: Ability to withstand stress (Yield Strength, UTS)
- Ductility: Ability to deform without fracture (Elongation, Reduction Area)
- Toughness: Energy absorption capacity (Charpy Energy)
- Low-temperature performance: Fracture behavior at cold temperatures (Charpy Temperature, FATT)
- Hardness: Resistance to deformation and cracking susceptibility
The fusion zone and heat-affected zone (HAZ) undergo complex phase transformations:
- Rapid cooling → Martensite (hard, strong, brittle)
- Slow cooling → Acicular Ferrite (ductile, tough)
Trade-off:
- Excessive martensite → high hardness, cracking risk
- Excessive ferrite → low strength
After extensive literature review and physical analysis, we propose:
WQI = α₁ × (Normalized_Strength) +
α₂ × (Normalized_Ductility) +
α₃ × (Normalized_Toughness) +
α₄ × (Normalized_Hardness_Factor) +
α₅ × (Normalized_Microstructure_Balance)
Where:
- Normalized_Strength = f(Yield_Strength, UTS)
- Normalized_Ductility = f(Elongation, Reduction_Area)
- Normalized_Toughness = f(Charpy_Energy, Charpy_Temperature)
- Normalized_Hardness_Factor = f(Hardness, FATT)
- Normalized_Microstructure_Balance = f(Primary_Ferrite, Acicular_Ferrite, Martensite, Ferrite_Carbide)
Weights (α₁, α₂, α₃, α₄, α₅) reflect the relative importance of each component based on application requirements.
-
Strength Component:
- High strength enables load-bearing capacity
- Determined by dislocation density and hard phase content
-
Ductility Component:
- Prevents brittle fracture
- Reflects metal's plastic deformation capability
-
Toughness Component:
- Energy absorption before failure
- Critical for impact resistance and low-temperature applications
-
Hardness Factor:
- Moderate hardness is desirable
- Too high → cracking risk; Too low → wear resistance issues
-
Microstructure Balance:
- Optimal mix of phases (acicular ferrite dominant)
- Martensite controlled to acceptable levels
- Phase proportions sum to ~100%
- Predict all 13 properties using Group 1 & Group 2 methodologies
- Normalize each predicted property to [0,1] scale
- Apply WQI formula with application-specific weights
- Validate against known good/bad welds
- Optimize weights using validation dataset
Welding-Quality-Prediction-Project/
├── welddatabase/
│ ├── welddb.csv # Original dataset
│ └── welddb_new.csv # Processed dataset
│
├── Group_1_Supervised_Learning/
│ ├── Yield_Strength_UTS/
│ │ ├── 1_PCA_Analysis.ipynb
│ │ ├── 2_Model_Training.ipynb
│ │ ├── 3_UTS_Prediction.ipynb
│ │ ├── data/ # PCA-transformed data
│ │ ├── pca_model/ # PCA transformer & scaler
│ │ ├── trained_models/ # Best models & comparisons
│ │ └── figures/ # Visualizations
│ ├── Elongation/
│ ├── Reduction_Area/
│ ├── Charpy_Energy/
│ └── Charpy_Temperature/
│
├── Group_2_Semi_Supervised_Learning/
│ ├── Hardness/ # Original PCA-based approach
│ │ ├── 1_PCA_Analysis.ipynb
│ │ ├── 2_Model_Training_Reduced_Data.ipynb
│ │ └── Semi_Supervised_Training.ipynb
│ │
│ ├── Hardness_2nd_Approach/ # Self-training approach
│ │ ├── Hardness_Semi_Supervised_Learning.ipynb
│ │ ├── README.md
│ │ ├── data/ # Complete predictions
│ │ ├── models/ # Best model
│ │ └── figures/ # Distributions
│ │
│ ├── Group2_All_Targets/ # Unified workflow
│ │ ├── Group2_Targets_Semi_Supervised_Learning.ipynb
│ │ ├── README.md
│ │ ├── data/ # All targets complete
│ │ ├── models/ # 6 best models
│ │ └── figures/ # Comprehensive visualizations
│
│
├
└── README.md # This file
✓ Excellent prediction accuracy: R² > 0.80 for all properties
✓ Dimensionality reduction: 52 → ~15-25 features (95% variance retained)
✓ Computational efficiency: Faster training with reduced features
✓ Model comparison: 9 algorithms evaluated per property
✓ Overcame data sparsity: Reliable predictions with only 2-8% labeled data
✓ Leveraged unlabeled samples: ~1,500 samples per property
✓ Significant improvements: +5% to +20% R² over supervised baseline
✓ Unified workflow: All 6 targets in single notebook
✓ Physical validation: Predictions consistent with metallurgical relationships
✓ Complete dataset: All 13 properties predicted for 1,654 samples
✓ Weld quality assessment: Foundation for WQI calculation
✓ Reproducible framework: Documented methodology with LaTeX reports
✓ Extensible codebase: Sklearn-compatible custom implementations
- Python 3.x
- Core ML: scikit-learn, XGBoost, LightGBM
- Data Processing: pandas, numpy
- Visualization: matplotlib, seaborn
- Model Persistence: joblib
- Documentation: Jupyter Notebooks, LaTeX
- Run
1_PCA_Analysis.ipynbto generate PCA-transformed dataset - Run
2_Model_Training.ipynbto train and compare models - Best model automatically saved for deployment
Option 1 - Single Target (Hardness):
- Run
Hardness_2nd_Approach/Hardness_Semi_Supervised_Learning.ipynb
Option 2 - All Targets:
- Run
Group2_All_Targets/Group2_Targets_Semi_Supervised_Learning.ipynb
- Load all predicted properties from saved models
- Apply normalization to [0,1] scale
- Compute WQI using weighted formula
- Validate against known quality benchmarks