This project focuses on analyzing athlete performance data using symmetry metrics and building predictive models to classify athletes into different risk categories (Low Risk, Medium Risk, High Risk).
-
Data Collection:
- Source athlete performance data with metrics such as
leftAvgForce
,rightAvgForce
,ImpulseSymmetry
, etc. - Ensure data contains necessary identifiers (
sbuid
,testDateUtc
).
- Source athlete performance data with metrics such as
-
Pivoting the Data:
- Reshape the dataset using a pivot table to organize metrics as columns for each athlete and test date.
-
Preprocessing:
- Calculate symmetry metrics (
ForceSymmetry
,ImpulseSymmetry
,MaxForceSymmetry
,TorqueSymmetry
). - Handle missing values, remove duplicates, and cap outliers using the IQR method.
- Calculate symmetry metrics (
-
Threshold Definition:
- Define thresholds for symmetry metrics based on domain knowledge or data distribution.
- Apply dynamic buffer logic for flexible risk categorization.
- Group symmetry metrics by risk categories (
Low Risk
,Medium Risk
,High Risk
). - Perform statistical tests (e.g., ANOVA or Kruskal-Wallis) to assess if differences in metrics across categories are significant.
- Document results and determine which metrics are most impactful for risk categorization.
-
Define Features and Target:
- Features: Symmetry metrics (
ForceSymmetry
,MaxForceSymmetry
,TorqueSymmetry
). - Target:
RiskCategory
(encoded as Low = 0, Medium = 1, High = 2).
- Features: Symmetry metrics (
-
Handle Class Imbalance:
- Experiment with different techniques:
- SMOTE Oversampling
- SMOTEENN (Combined Oversampling and Undersampling)
- Class Weights in Random Forest
- Experiment with different techniques:
-
Train Models:
- Build Random Forest models for each technique.
- Compare results of the following:
- Model 1: SMOTE Oversampling
- Model 2: No Balancing
- Model 3: SMOTEENN
- Model 4: Class Weights
-
Evaluate Models:
- Metrics: Accuracy, F1-Score, Precision, Recall, Confusion Matrix.
- Visualize confusion matrices and accuracy comparisons for each model.
- Preprocess new datasets to calculate symmetry metrics.
- Use trained models to predict risk categories for new athletes.
- Compare predictions across all models for consistency and reliability.
- Plot risk distribution across metrics.
- Visualize confusion matrices for all models.
- Display quarterly trends for selected athletes (if temporal data is available).
- Summarize hypothesis testing results and model comparisons in charts.
- Deploy the final models and preprocessing pipeline using a tool like Streamlit.
- Document the project with:
- A clear README file summarizing the project.
- Model insights and key findings.
- Provide options for future enhancements:
- Expand to include additional symmetry metrics.
- Test on a broader dataset with different sports.