Fix: Found array with 0 sample(s)#743
Open
allenyllee wants to merge 1 commit intoscikit-learn-contrib:masterfrom
Open
Fix: Found array with 0 sample(s)#743allenyllee wants to merge 1 commit intoscikit-learn-contrib:masterfrom
allenyllee wants to merge 1 commit intoscikit-learn-contrib:masterfrom
Conversation
Symptom: When using SVMSMOTE on dataset which contains a minority class which has very few samples (may be < 10), it'll raise error `ValueError: Found array with 0 sample(s) (shape=(0, 600)) while a minimum of 1 is required.` Root cause: The line `noise_bool = self._in_danger_noise(...)` will find noise data according to `kneighbors` estimator's `n_neighbors` attribute, this value is equal to `m_neighbors` attribute of `SVMSMOTE` class. If we set a very large number to `m_neighbors` to initialize `SVMSMOTE`, for example: `SVMSMOTE(m_neighbors=1000)`, this error will be gone. This is because the range of neighbor searches is large enough to contain another minority class data point, therefore the center data point will not be treated as noise according to this line `n_maj == nn_estimator.n_neighbors - 1`. But when `m_neighbors` is small (default is 10), and the minority class has very few sample, it may treat whole minority class data as noise data, cause returned `noise_bool` with all true, then in _safe_indexing(...) will remove all these data, resulted in zero number of support_vector data. Solution: Save `support vector` before trimming noise data point. When after trimmed noise data, check whether the length of support vector is zero, if true, then restore previous saved `support vector`, this enforce every minority data point used as `support_vector`.
|
Hello @allenyllee! Thanks for opening this PR. We checked the lines you've touched for PEP 8 issues, and found:
|
Member
|
You will need to correct the PEP8 issue. I think that we should raise a warning as well because we are not strictly performing the algorithm which is expected (but we are in a corner case). |
Member
|
We will need a non-regression test (that you posted in the issue) and an entry in what's new as well since it would impact the end-user |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Symptom:
When using SVMSMOTE on dataset which contains a minority class which has very few samples (may be < 10), it'll raise error
ValueError: Found array with 0 sample(s) (shape=(0, 600)) while a minimum of 1 is required.Reference Issue
#742
What does this implement/fix? Explain your changes.
Root cause:
The line
noise_bool = self._in_danger_noise(...)will find noise data according tokneighborsestimator'sn_neighborsattribute, this value is equal tom_neighborsattribute ofSVMSMOTEclass. If we set a very large number tom_neighborsto initializeSVMSMOTE, for example:SVMSMOTE(m_neighbors=1000), this error will be gone. This is because the range of neighbor searches is large enough to contain another minority class data point, therefore the center data point will not be treated as noise according to this linen_maj == nn_estimator.n_neighbors - 1. But whenm_neighborsis small (default is 10), and the minority class has very few sample, it may treat whole minority class data as noise data, cause returnednoise_boolwith all true, then in _safe_indexing(...) will remove all these data, resulted in zero number of support_vector data.Solution:
Save
support vectorbefore trimming noise data point. When after trimmed noise data, check whether the length of support vector is zero, if true, then restore previous savedsupport vector, this enforce every minority data point used assupport_vector.Any other comments?