-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
Open
Description
With Python 3.14 tests throw an error:
gensim/test/test_corpora.py:693:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
gensim/corpora/wikicorpus.py:642: in __init__
self.dictionary = Dictionary(self.get_texts())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
gensim/corpora/dictionary.py:78: in __init__
self.add_documents(documents, prune_at=prune_at)
gensim/corpora/dictionary.py:196: in add_documents
for docno, document in enumerate(documents):
^^^^^^^^^^^^^^^^^^^^
gensim/corpora/wikicorpus.py:698: in get_texts
for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
gensim/utils.py:1382: in chunkize
worker.start()
/home/xx/miniconda3/envs/gensim-314/lib/python3.14/multiprocessing/process.py:121: in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
/home/xx/miniconda3/envs/gensim-314/lib/python3.14/multiprocessing/context.py:224: in _Popen
return _default_context.get_context().Process._Popen(process_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/home/xx/miniconda3/envs/gensim-314/lib/python3.14/multiprocessing/context.py:300: in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
/home/xx/miniconda3/envs/gensim-314/lib/python3.14/multiprocessing/popen_forkserver.py:35: in __init__
super().__init__(process_obj)
/home/xx/miniconda3/envs/gensim-314/lib/python3.14/multiprocessing/popen_fork.py:20: in __init__
self._launch(process_obj)
/home/xx/miniconda3/envs/gensim-314/lib/python3.14/multiprocessing/popen_forkserver.py:47: in _launch
reduction.dump(process_obj, buf)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
obj = <InputQueue name='InputQueue-214' parent=181788 initial daemon>, file = <_io.BytesIO object at 0x733d35538d60>, protocol = None
def dump(obj, file, protocol=None):
'''Replacement for pickle.dump() using ForkingPickler.'''
> ForkingPickler(file, protocol).dump(obj)
E TypeError: cannot pickle 'generator' object
E when serializing dict item 'corpus'
E when serializing gensim.utils.InputQueue state
E when serializing gensim.utils.InputQueue object
With Python 3.14 forkserver is now the default to start a new process on *nix systems (prior it was fork). Forkserver seems to be stricter on pickling generators/iterators.
The issue is in utils.py in chunkize() for *nix systems.
Converting the corpus generator to a list would fix the issue but comes with increased memory consumption:
if maxsize > 0:
if inspect.isgenerator(corpus):
corpus = list(corpus)
q = multiprocessing.Queue(maxsize=maxsize)
worker = InputQueue(q, corpus, chunksize, maxsize=maxsize, as_numpy=as_numpy)
Metadata
Metadata
Assignees
Labels
No labels