This project evaluates the performance of the TabPFNClassifier in comparison to other popular machine learning models across multiple datasets. The evaluation focuses on metrics such as accuracy, fit time, and predict time to highlight the strengths and trade-offs of the TabPFNClassifier relative to conventional models.
The TabPFNClassifier is a tabular-specific pretrained feedforward neural network designed to deliver high accuracy on tabular data. This project benchmarks its performance against traditional machine learning models to identify scenarios where it excels and areas where it may face limitations.
The primary goal of this project is to compare the performance of TabPFNClassifier with other popular models, including Logistic Regression, Gaussian Naive Bayes, k-Nearest Neighbors (k-NN), Decision Trees, and Support Vector Machines (SVM). The comparison evaluates:
- Accuracy: The ability to classify instances correctly.
- Fit Time: The time taken to train the model.
- Predict Time: The time required for making predictions.
The datasets used for evaluation are:
- Breast Cancer: Predicts whether tumors are malignant or benign.
- Iris: Classifies iris flowers into three species.
- Wine: Categorizes wine into three classes based on chemical properties.
The models included in this study are:
- TabPFNClassifier (Tabular Pretrained Feedforward Neural Network)
- Logistic Regression
- Gaussian Naive Bayes
- k-Nearest Neighbors (k-NN)
- Decision Tree
- Support Vector Machine (SVM) with RBF kernel
Dataset | Model | Accuracy | Fit Time (ms) | Predict Time (ms) |
---|---|---|---|---|
breast_cancer | LogisticRegression | 0.978723 | 17.718315 | 0.000000 |
breast_cancer | GaussianNB | 0.941489 | 0.995636 | 0.000000 |
breast_cancer | k-Nearest Neighbor | 0.957447 | 0.000000 | 9.120703 |
breast_cancer | DecisionTree | 0.920213 | 4.174471 | 0.000000 |
breast_cancer | SVM | 0.968085 | 2.934217 | 1.084566 |
breast_cancer | TabPFNClassifier | 0.978723 | 0.000000 | 951.638937 |
iris | LogisticRegression | 0.980000 | 4.997015 | 0.000000 |
iris | GaussianNB | 0.960000 | 1.000643 | 0.000000 |
iris | k-Nearest Neighbor | 0.980000 | 0.996351 | 1.993179 |
iris | DecisionTree | 0.980000 | 0.000000 | 0.000000 |
iris | SVM | 0.980000 | 0.000000 | 0.000000 |
iris | TabPFNClassifier | 0.980000 | 0.114679 | 223.053932 |
wine | LogisticRegression | 0.983051 | 4.358292 | 0.000000 |
wine | GaussianNB | 1.000000 | 0.000000 | 0.996113 |
wine | k-Nearest Neighbor | 0.966102 | 0.000000 | 2.502203 |
wine | DecisionTree | 0.966102 | 1.044512 | 0.000000 |
wine | SVM | 0.983051 | 0.741243 | 0.000000 |
wine | TabPFNClassifier | 1.000000 | 0.998259 | 269.511700 |
-
TabPFNClassifier Strengths:
- Achieved the highest accuracy on all datasets, matching or exceeding other models.
- Particularly effective on the Wine dataset, achieving perfect accuracy.
-
TabPFNClassifier Limitations:
- The prediction time for the TabPFNClassifier is significantly longer than other models, especially on larger datasets like Breast Cancer (~951 ms).
-
Efficiency Comparison:
- Models like GaussianNB and DecisionTree excel in computational efficiency, with minimal fit and prediction times.
- Logistic Regression and SVM offer a good balance between accuracy and computational speed.
- Use TabPFNClassifier for scenarios requiring top-tier accuracy, particularly when prediction speed is not a constraint.
- For real-time applications or larger datasets, consider faster models like GaussianNB or DecisionTree.
- Python 3.8 or later
- Required libraries:
scikit-learn
,pandas
,plotly
,tabpfn
Install dependencies using pip:
pip install scikit-learn pandas plotly tabpfn
Open the main.ipynb Jupyter Notebook and execute all cells to reproduce the analysis:
jupyter notebook main.ipynb
- Interactive Accuracy Graph: Visualizes the accuracy of models across datasets.
- Interactive Time Graph: Compares fit and predict times across models.
- Results DataFrame: A tabular summary of metrics.
This project is licensed under the MIT License.