Skip to content

Commit

Permalink
Make mini-batch TF-IDF raise an exception (#1631)
Browse files Browse the repository at this point in the history
* Make the mini-batch methods unavailable for TF-IDF

There is currently no mini-batch implementation of TF-IDF.
To prevent Python from using the methods  from the parent class
BagOfWords (which would give incorrect results), we add the methods to
TF-IDF and raise an error.

* Add missing parameters from VectorizerMixin

The paramters were documented in the docstring but were not in the
constructor.

* Changelog entry
  • Loading branch information
e10e3 authored Nov 14, 2024
1 parent 5428c71 commit de119ab
Show file tree
Hide file tree
Showing 2 changed files with 17 additions and 0 deletions.
4 changes: 4 additions & 0 deletions docs/releases/unreleased.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@

- Make `drift.ADWIN` comply with the reference MOA implementation.

## feature extraction

- The mini-batch methods for `feature_extraction.TFIDF` now systematically raise an exception, as they are not implemented.

## stats

- Removed the unexported class `stats.CentralMoments`.
Expand Down
13 changes: 13 additions & 0 deletions river/feature_extraction/vectorize.py
Original file line number Diff line number Diff line change
Expand Up @@ -451,6 +451,8 @@ def __init__(
strip_accents=True,
lowercase=True,
preprocessor: typing.Callable | None = None,
stop_words: set[str] | None = None,
tokenizer_pattern=r"(?u)\b\w[\w\-]+\b",
tokenizer: typing.Callable | None = None,
ngram_range=(1, 1),
):
Expand All @@ -459,6 +461,8 @@ def __init__(
strip_accents=strip_accents,
lowercase=lowercase,
preprocessor=preprocessor,
stop_words=stop_words,
tokenizer_pattern=tokenizer_pattern,
tokenizer=tokenizer,
ngram_range=ngram_range,
)
Expand Down Expand Up @@ -489,3 +493,12 @@ def transform_one(self, x):
norm = math.sqrt(sum(tfidf**2 for tfidf in tfidfs.values()))
return {term: tfidf / norm for term, tfidf in tfidfs.items()}
return tfidfs

# Mini-batch methods should be done well™ and not just be a loop over the *_one equivalent.
def learn_many(self, X):
"Not available, will raise an exception."
raise NotImplementedError

def transform_many(self, X):
"Not available, will raise an exception."
raise NotImplementedError

0 comments on commit de119ab

Please sign in to comment.