Scikit-learn developers use the pytest command line tool to collect the list of all tests written in the scikit-learn source code, execute them and report any failure message.
You can install pytest
with conda:
$ conda install pytest
A test is typically written as a Python function with assert statement, for instance:
def test_sum_of_two_integers():
a = 40
b = 2
expected_result = 42
assert a + b == expected_result
Scikit-learn comes with two types of tests:
- common tests, generic tests that every estimator should follow;
- tests specific to each scikit-learn module.
Common tests are stored in the sklearn/tests
folder. You can run them using:
$ pytest sklearn/tests/
If you want to run tests on one specific estimator, eg RandomForestClassifier, you can use
$ pytest sklearn/tests -k RandomForestClassifier
Each sklearn subpackage comes with a tests
folder that gathers the test files specifically related to it. For instance the folder sklearn/cluster/tests
holds all the test file written specifically to test clustering algorithms. In particular, sklearn/cluster/tests/test_k_means.py
is the test file for K-Means clustering algorithm.
Use VSCode to open the source code for the RandomForestClassifier
class.
- What is the path of this file?
- Can you find the folder that holds the tests for
RandomForestClassifier
in the VSCode file explorer? - Can you find the test folder and list its content using the
ls
command? - Locate the file
test_forest.py
in that folder.
Again use the pwd
command if you are lost.
To run the test
$ pytest --verbose ./<path_to_test_folder>/test_forest.py
Edit the source code of RandomForestClassifier
to change the predict method always return 0.
Rerun the test: what do you observe?
alternatively you can use the vlx
flags as follows:
$ pytest -vlx ./<path_to_test_folder>/test_forest.py
Find out about the meaning of those flags with
$ pytest --help
See also the section Useful pytest aliases and flags in the scikit-learn documentation.
We will now do an exercise to add a new test function named test_rf_regressor_prediction_range
in the test_forest.py
file.
The goal of this new test will be to check that a RandomForestRegressor
never predicts numerical values outside of the range of values observed in
the training set.
The test will use a training set generated at random by sampling from that Normal distribution.
For a test we can use a small dataset, for instance 100 samples and 10 features.
Here is a code template to get started:
def test_rf_regressor_prediction_range():
# Create a Random Number Generator with 42 as a fixed seed.
rng = np.random.RandomState(42)
# TODO: define the n_samples and n_features variables
X_train = rng.normal(size=(n_samples, n_features))
y_train = rng.normal(size=n_samples)
rfr = RandomForestRegressor(random_state=42)
rfr.fit(X_train, y_train)
# TODO: np.min() and np.max() to measure the minimum and
# maximum values of the training set target values y_train.
# TODO: generate some random data X_test with the same number
# of features.
# TODO: compute the model predictions on the test data:
# rfr.predict(X_test) and store the results in a variable y_preds.
# TODO: check that all the values in y_preds lie between the minimum
# and maximum values of y_train.
After each TODO, check that your code works as expected by running only your test function with the following command:
$ pytest -vl -k test_rf_regressor_prediction_range \
./<path_to_test_folder>/test_forest.py
Bonus exercise: try to modify the code of scikit-learn to make your new test fail by making it predict very large or very small values.
Note: do not commit those modifications with git.
It is recommended to format your code with a specific version of black
:
$ conda install black==22.1.0
The you can run:
$ black path/to/some_file.py
or to run it for a full source folder:
$ black sklearn/
$ black examples/
and so on.
It's also convenient to run black directly from with-in VS Code:
Ctrl+Shift+P
and then "> Format Current Document"
and select the black
formatter the first time your use this command.
Once the exercise is done, feel free to undo all your changes with:
git reset --hard
pytest
allows to also test docstrings describing the estimator directly in the source code.
Assuming the RandomForestClassifier
docstring has been modified, the following command line
is used to test the docstring compliancy:
$ pytest --doctest-modules sklearn/ensemble/_forest.py -k RandomForestClassifier