Skip to content

Corpus chunking on Python 3.14 fails due to generators not pickable #3628

@julianpollmann

Description

@julianpollmann

With Python 3.14 tests throw an error:

gensim/test/test_corpora.py:693: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
gensim/corpora/wikicorpus.py:642: in __init__
    self.dictionary = Dictionary(self.get_texts())
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
gensim/corpora/dictionary.py:78: in __init__
    self.add_documents(documents, prune_at=prune_at)
gensim/corpora/dictionary.py:196: in add_documents
    for docno, document in enumerate(documents):
                           ^^^^^^^^^^^^^^^^^^^^
gensim/corpora/wikicorpus.py:698: in get_texts
    for group in utils.chunkize(texts, chunksize=10 * self.processes, maxsize=1):
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
gensim/utils.py:1382: in chunkize
    worker.start()
/home/xx/miniconda3/envs/gensim-314/lib/python3.14/multiprocessing/process.py:121: in start
    self._popen = self._Popen(self)
                  ^^^^^^^^^^^^^^^^^
/home/xx/miniconda3/envs/gensim-314/lib/python3.14/multiprocessing/context.py:224: in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/home/xx/miniconda3/envs/gensim-314/lib/python3.14/multiprocessing/context.py:300: in _Popen
    return Popen(process_obj)
           ^^^^^^^^^^^^^^^^^^
/home/xx/miniconda3/envs/gensim-314/lib/python3.14/multiprocessing/popen_forkserver.py:35: in __init__
    super().__init__(process_obj)
/home/xx/miniconda3/envs/gensim-314/lib/python3.14/multiprocessing/popen_fork.py:20: in __init__
    self._launch(process_obj)
/home/xx/miniconda3/envs/gensim-314/lib/python3.14/multiprocessing/popen_forkserver.py:47: in _launch
    reduction.dump(process_obj, buf)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

obj = <InputQueue name='InputQueue-214' parent=181788 initial daemon>, file = <_io.BytesIO object at 0x733d35538d60>, protocol = None

    def dump(obj, file, protocol=None):
        '''Replacement for pickle.dump() using ForkingPickler.'''
>       ForkingPickler(file, protocol).dump(obj)
E       TypeError: cannot pickle 'generator' object
E       when serializing dict item 'corpus'
E       when serializing gensim.utils.InputQueue state
E       when serializing gensim.utils.InputQueue object

With Python 3.14 forkserver is now the default to start a new process on *nix systems (prior it was fork). Forkserver seems to be stricter on pickling generators/iterators.

The issue is in utils.py in chunkize() for *nix systems.
Converting the corpus generator to a list would fix the issue but comes with increased memory consumption:

if maxsize > 0:
    if inspect.isgenerator(corpus):
        corpus = list(corpus)
    q = multiprocessing.Queue(maxsize=maxsize)
    worker = InputQueue(q, corpus, chunksize, maxsize=maxsize, as_numpy=as_numpy)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions