Developed by DeepField ML - Gordon.H
A sophisticated football analytics platform leveraging historical match data and machine learning to provide tactical formation recommendations.
- Introduction
- Features
- Architecture Overview
- Data Architecture
- Installation
- Data Sources
- Data Processing Pipeline
- Formation Conversion Logic
- Machine Learning System
- Formation Recommendation System
- Integration Guide
- Performance Metrics
- Contributing
- License
FusionForm is a sophisticated football analytics platform designed to provide data-driven tactical insights. It analyzes historical match outcomes and the formations used by teams to build a predictive model. The system aims to recommend the most effective tactical formation for a team to adopt when facing a specific opponent formation, thereby supporting tactical decision-making.
- Historical Data Processing: Ingests and prepares game results and formation data from diverse sources.
- Dual Perspective Analysis: Transforms data to capture tactical interactions from the viewpoint of both competing teams.
- Formation Conversion: Intelligent conversion mechanisms to adapt formations between different team sizes while preserving tactical intent.
- Machine Learning Prediction: A neural network model predicts the likely outcome score of a matchup between two formations.
- Formation Recommendation Engine: Identifies the statistically optimal formation against a given opponent formation based on model predictions.
- Efficient Inference: Optimized model deployment for fast prediction generation.
- Integration Ready: Designed for seamless integration into other applications and workflows.
The system operates as a pipeline, transforming raw data through several stages into actionable tactical recommendations.
flowchart TD
A[Game Results Data] --> B[Data Preprocessing]
C[Formation Data] --> B
B --> D[Dual Perspective Transformation]
D --> E[Formation Encoding]
E --> F[Neural Network Training]
F --> G[Formation Recommendation Engine]
G --> H[Tactical Decision Support]
The foundation of FusionForm's analysis is built upon game results and detailed formation data. Processed data, including encoded formations and the trained predictive model, fuels the recommendation engine.
erDiagram
GAME_RESULTS {
string date
string match_date
string home_team
string away_team
int goals_home
int goals_away
string result
}
FORMATION_DATA {
int gameid
string team
string formation
string players
}
ENCODED_FORMATIONS {
int formation_id
string formation_pattern
vector embedding
}
MODEL {
string name
binary weights
float accuracy
}
FORMATION_RECOMMENDATION {
string opponent_formation
string recommended_formation
float predicted_outcome_score
}
GAME_RESULTS }|--|| FORMATION_DATA : contextualizes
FORMATION_DATA }|--|| ENCODED_FORMATIONS : transforms_to
ENCODED_FORMATIONS }|--|| MODEL : trains
MODEL }|--|| FORMATION_RECOMMENDATION : generates
The system utilizes game results data (match outcomes, team details) stored in a parquet file, complemented by formation data from CSV files that provide specific tactical layouts used in matches. These sources are integrated and processed to derive insights.
Directly install the unix executables(mac and linux) and run chmod +X "NAME OF FILE"
to be able to run it. Or you can clone the repo and run the codes and models.
- Game Results: Contains fundamental match data including dates, teams, and scores. This data provides the outcome information used to train the predictive model.
- Formation Data: Provides the tactical formations employed by each team in recorded matches, linked to the game results data.
- Source: Great thanks for Mr.Schoch for providing the dataset. [https://github.com/schochastics/football-data]
The data processing pipeline transforms raw historical data into a structured and clean format suitable for machine learning.
This initial stage involves reading raw data from its stored locations – game results from a parquet file and formation data from various CSV files. The data is loaded into a usable structure, typically in-memory dataframes.
Ensuring data consistency is crucial. This step involves standardizing entries such as team names, removing inconsistencies or variations (like parenthetical team origin notes), and cleaning formation strings to a uniform format. Handling missing or malformed data points also occurs here.
A key analytical step where each match record is processed to generate two distinct data points: one from the perspective of the home team and one from the away team. This transformation frames the data around "Our Formation" vs. "Opponent Formation" and the "Outcome" from "Our" perspective, effectively doubling the dataset and enabling the model to learn formation matchups neutrally.
graph LR
A[Original Match Data] --> B{Home Perspective}
A --> C{Away Perspective}
B --> D[Our Formation: 4-3-3]
B --> E[Opponent Formation: 4-4-2]
B --> F[Outcome: Win]
C --> G[Our Formation: 4-4-2]
C --> H[Opponent Formation: 4-3-3]
C --> I[Outcome: Loss]
A("Home Team: 4-3-3<br>Away Team: 4-4-2<br>Score: 2-1")
D("Our Formation: 4-3-3")
E("Opponent Formation: 4-4-2")
F("Outcome: Win")
G("Our Formation: 4-4-2")
H("Opponent Formation: 4-3-3")
I("Outcome: Loss")
A unique capability allowing the system to analyze tactical setups across different team sizes (e.g., 11-a-side, 8-a-side, 7-a-side). This is achieved through rule-based conversion processes designed to maintain the tactical essence of the original formation.
This logic condenses an 11-player formation into a 7-player structure. It employs a prioritized rule set focusing on retaining the central defensive core, midfield control, and attacking presence while adapting width and balance for the smaller team size.
The reverse process expands an 8-player formation to an 11-player one. This involves strategically adding players, such as full-backs and wide midfielders, to enhance defensive solidity, midfield presence, and attacking options while preserving the strategic intent of the original 8-a-side formation.
graph TD
A[8-a-side Formation] --> B{Categorize Roles}
B --> C[Expand Defense]
C --> D[4 Defenders]
B --> E[Expand Midfield]
E --> F[4 Midfielders]
B --> G[Adjust Attack]
G --> H[2-3 Attackers]
D & F & H --> I[11-a-side Formation]
Integral to the conversion logic is the consideration of specific player positions and roles (e.g., central vs. wide, defensive vs. attacking responsibilities). This ensures that conversions are not merely numerical reductions or expansions but maintain the tactical balance and functional layout of the formation.
The core of FusionForm's predictive power lies in its machine learning system, which learns from historical data to predict formation matchup outcomes.
Before being fed into the neural network, formation strings are converted into a numerical format. This process involves mapping each unique formation string to a numerical identifier and then representing these identifiers as dense or sparse vectors (e.g., using one-hot encoding).
The predictive model is a neural network designed to take the encoded representations of two formations (our team's and the opponent's) as input. It processes this information through layers of interconnected nodes to produce a single output value, representing the predicted outcome score for that formation matchup.
graph LR
A[Input Layer: 2*num_classes] --> B[Hidden Layer 1: 64 neurons]
B --> C[Hidden Layer 2: 32 neurons]
C --> D[Output Layer: 1 neuron]
The model is trained using the prepared and encoded historical data. The training process involves iterating through the data multiple times (epochs), feeding batches of formation matchups to the model, calculating the error between the model's prediction and the actual historical outcome, and adjusting the model's internal parameters (weights) to minimize this error using an optimization algorithm. A portion of the data is held out for validation during training.
Once trained, the neural network model is exported to an efficient format like ONNX. This allows the model to be loaded and run quickly to generate predictions (inference
) in deployment environments, separate from the training setup.
This system utilizes the trained machine learning model to provide tactical recommendations.
Given a specific opponent formation, the system evaluates all possible formations that the 'recommending' team could use. For each potential 'our' formation, it queries the machine learning model to predict the outcome score of that particular matchup (our formation vs. opponent formation).
The system's optimization strategy is to identify and recommend the formation that yields the highest predicted outcome score against the specified opponent formation. This exhaustive evaluation ensures that the system considers all known tactical options to find the statistically most favorable setup. The system includes checks to ensure the opponent formation provided is valid and recognized.
graph TD
A[Opponent Formation Input] --> B{Validate Format & Recognition}
B -- Invalid --> C[Report Error]
B -- Valid --> D[Encode Opponent Formation]
D --> E[Generate All Possible Formations]
E --> F{For Each Possible Formation}
F --> G[Combine with Opponent Encoded]
G --> H[Predict Outcome Score ONNX]
H --> I{Compare Score}
I -- New Best? --> J[Update Best Formation]
I -- Not Best --> F
J --> F
F -- All Evaluated --> K[Return Best Formation]
Integrating FusionForm's core recommendation capability involves loading the trained ONNX model and the formation encoding mapping (LabelEncoder). Once loaded, these components can be used to process an opponent formation input and retrieve the recommended formation output via a dedicated function.
Consult the relevant source files and documentation for specific API details on loading components and calling the recommendation function.
This section details the metrics used to evaluate the performance of the FormationPredictor model and the configuration used during its training.
The primary metric used to evaluate the effectiveness of the neural network model is Mean Squared Error (MSE). This metric measures the average squared difference between the predicted outcome score (representing the predicted result of a formation matchup) and the actual outcome value derived from the historical data.
The model training process is configured with the following settings:
- Data Split: The processed dataset is split into training and testing sets, with 80% used for training the model and the remaining 20% held out for evaluating its performance after each training iteration.
- Batch Size: Training updates are performed using batches of 64 data samples.
- Number of Epochs: The model is trained for 5 full passes over the entire training dataset.
- Optimizer: The Adam optimization algorithm is used to adjust the model's parameters, with a learning rate set to 0.001.
Model performance on the test dataset is evaluated after the completion of each training epoch. This involves using the model to predict outcomes for the held-out test data. The Mean Squared Error is calculated on these predictions compared to the actual test outcomes. This test loss provides an indication of the model's generalization capability and helps monitor for potential overfitting as training progresses.
The performance metrics are calculated for the FormationPredictor neural network, which has a layered structure designed to process encoded formation inputs. It includes an input layer sized to accept the combined representation of two formations, followed by two hidden layers with 64 and 32 neurons respectively, and concludes with a single output neuron that produces the predicted outcome score.
During the training process, the calculated test loss value for each epoch is printed to the console. The output format displays the current epoch number and the corresponding Mean Squared Error on the test dataset, typically formatted to four decimal places (e.g., Epoch X/5, Test Loss: Y.YYYY
).
We welcome contributions to FusionForm! Please submit pull requests or issues.
This project is licensed under the LICENSE file.