random.RandomState with different versions of numpy has vastly different performance #2782

alexcarterkarsus · 2020-04-03T17:16:49Z

the performance of random.RandomState in word2vec.py (version 3.8.0)

def seeded_vector(self, seed_string, vector_size):
         once = random.RandomState(self.hashfxn(seed_string) & 0xffffffff)
         return (once.rand(vector_size) - 0.5) / vector_size

seemingly depends greatly on the version of numpy installed. With numpy = 1.14.3, the following code

from  numpy.random import RandomState as Ran
from time import time
t1 = time()
for i in range(100000):
    temp = Ran(hash((i)) & 0xffffffff)
t2 = time()
t2-t1

produced
0.28105926513671875
exactly the same code with numpy= 1.18.1 produced

18.590345859527588

I noticed this because I was training a model with millions of words as vocabulary, and after updating numpy unwittingly (via a anaconda update), I noticed that the time for build_vocab was significantly longer, and after some debugging, I nailed it down to random.RandomState in the seeded_vector function.
I know this is indeed a numpy issue, but even they mentioned it that RandomState is legacy (https://docs.scipy.org/doc/numpy/reference/random/performance.html). Therefore I wonder if you have some plans to upgrade randomstate? Thanks!

The text was updated successfully, but these errors were encountered:

piskvorky · 2020-04-03T17:30:48Z

Thanks for reporting. What was the difference in random.RandomState that you nailed it down to?

We should replace RandomState if it's deprecated. What should it be replaced with, is the change 100% backward compatible? Can you open a PR?

alexcarterkarsus · 2020-04-03T19:00:40Z

what I found is that numpy.random.RandomState defaults to the legacy random generator, i.e., the slowest, however, there are other options: PCG64, SFC64, etc. You could simply use them (after import numpy.random as random) with, e.g., random.SFC64, instead of random.RandomState.Same code as above but using SFC64, even with numpy= 1.18.1, would give
2.767242193222046
which is the fastest of the bunch, although still not as fast as in numpy=1.14.3
I don't know exactly what caused the performance change inside numpy, and I also don't know how you can easily amend the code without breaking compatibility.

gojomo · 2020-04-05T04:27:20Z

I don't think it's necessary (or reasonable for users to expect) that the same seeding initializes vectors identically across changed gensim versions. (The starting state shouldn't be particularly important for end results under realistic uses – such as multithreaded training, which introduces lots more uncontrollable randomness.)

So we should just choose whatever numpy recommends as a fast pseudorandom generator.

piskvorky · 2020-04-05T09:56:20Z

Agreed. A 60x slowdown is really worrying.

gojomo · 2020-06-29T20:49:31Z

Ultimately any other uses of RandomState in gensim – https://github.com/RaRe-Technologies/gensim/search?q=RandomState&type=Code – will face the same slowdown unless updated to match numpy best-practices.

piskvorky · 2020-06-29T21:06:25Z

@zygm0nt do you think you could take care of all the places gensim uses the old (slow) numpy RNG?

zygm0nt · 2020-06-29T21:30:42Z

Sure, I can take care of that. Should this be done in this ticket? by reopening it?

piskvorky · 2020-06-29T22:04:18Z

Yeah, let me reopen. Thanks!

SagarDollin · 2021-08-26T12:33:50Z

Hi, @zygm0nt @piskvorky ,
Any updates on this issue? I would like to contribute to a PR if it is already created. Also, please help me with the best practices I must follow during a PR. Are there any unit tests to pass before creating a PR. Or do I need to check the integrity of the code locally (in which case point me to any documentation that could help)

Thanks,
looking forward to your reply.

piskvorky · 2021-08-26T12:57:24Z

Sure! The code style and instructions for Gensim are here: https://github.com/RaRe-Technologies/gensim/wiki/Developer-page
Thanks.

SagarDollin · 2021-08-26T19:57:58Z

Hi @piskvorky @gojomo,
I'm here with some updates about the work I did today.
From the above comments, it is clear that we need to replace all instances of RandomState and RandomState objects as it is deprecated from new versions of NumPy and also because it is 60 times slower.

What can we use to replace RandomState?
Ans: Instead of using np.random.RandomState(seed) we can use np.random.default_rng(seed) . default_rng instantiates a Generator with numpy's default BitGenerator(PCG64). For more info refer this documentation.

Will making the above change introduce new bugs?
Ans: The above change could potentially introduce new bugs if we do not take care of lines of code dependent on RandomState because we are using a Generator which is a replacement for RandomState , but they are not the exact same. To avoid any bugs, I have to do modifications wherever we have a dependency on RandomState.

To deal with any dependencies, I did some digging into the code, and I found few instances that have a dependency on RandomState, which will not cause any new bug given we do few modifications. Refer below for all the details(you can skip this part):

In the file gensim/utils the function get_random_state(seed) has the line return np.random.mtrand._rand on line 90, but the returned object np.random.mtrand._rand is an instance of RandomState(refer to the screenshot below)

The simple solution to this dependency is to replace np.random.mtrand._rand with an instance of Generator i.e., np.random.deafult_rng().
There are other dependencies like in the file gensim/models/poincare.py there are instances of like these:
1. self._np_random.uniform(..)
2. self._np_random.randint(..)
3. self._np_random.choice(..)
4. self._np_random.shuffle(..)
  where self._np_random = np_random.RandomState(seed).
  Now since we are replacing RandomState , the self._np_random will be a Generator, and we have the implementation of uniform(..), choice(..) , shuffle(..) with the same parameters, except randint in the Generator. I need to replace all occurrences of randint(..) with integers(..) that basically take the same parameters as randint in our case and do the same job.

You can see more info about these methods here in the documentation.

Note: Here, in any line of code I have mentioned with np, I'm referring to the numpy package.

To Conclude
Let me know if anyone has any other ideas or concerns with the above method soon, as I will start working on the PR immediately.

piskvorky · 2021-08-26T20:07:04Z

Sounds good to me, thanks.

After the replacement, can you do a sanity check re. performance? The new code should be faster (or at least not slower) than the existing code. I don't expect the overall impact will be too large (RNG is just a small part of the ML algos), but it would still be nice to include some concrete numbers in the release notes.

SagarDollin · 2021-08-27T18:54:56Z

Hey @piskvorky,
Thanks for your comment. Sure I will include a sanity test. I needed some clarifications for the sanity testing.

The Poincare model in the gensim/models/poincare.py, which seems to be heavily relying on RandomState methods. So I will be testing this model after the change. However I might need a big dataset in order to test this. The dataset poincare_hypernyms_large might be a little small for that. Let me know if you can point me to any other bigger dataset, or should I go forward with the current dataset?
Apart from Poincare there is word2vec that has RandomState but it is not using any of the methods of the RandomState. So I'm guessing testing this would give us the same results after making changes.
Should I include the sanity test file in the PR? Or is a screenshot of the results enough?

Thanks,
Sagar B Dollin

piskvorky · 2021-08-27T18:59:33Z

Should I include the sanity test file in the PR? Or is a screenshot of the results enough?

A summary of your benchmark as part of the PR description is enough. Nothing fancy or formal – really, a rudimentary sanity check. Thanks.

This is regarding the issue piskvorky#2782 . Here are the benchmarks of before and after updating: Before Update After Update Poincare Ran 42 tests in 0.418s Ran 42 tests in 0.417s test_lda Ran 48 tests in 223.845s Ran 48 tests in 225.561s utils Ran 24 tests in 0.007s Ran 24 tests in 0.007s test_matutils Ran 18 tests in 0.071s Ran 18 tests in 0.070s word2vec Ran 79 tests in 58.149s Ran 79 tests in 57.950s I don't find a big difference in time taken. However I feel it is good to be updated along with numpy.

SagarDollin · 2021-08-30T21:15:45Z

Hi @piskvorky,
The PR is ready for review.
Things achieved in this PR:

I have resolved all the dependencies on RandomState.
The code is still backward compatible in the sense, we can load a pre-trained model that relies on RandomState instead of the Generator and still be able to run it.

Note: I had to change few test files as well, as some tests were relying on hardcoded variables, and since we are using a new Random Generator, they tended to fail. So I have updated these tests. However, I'm not an expert in all the models implemented in this repo, so a review is required for any of these changes.

Thanks.

piskvorky added help wanted performance Issue related to performance (in HW meaning) reach HIGH Affects most or all Gensim users impact MEDIUM Big annoyance for affected users labels Apr 3, 2020

zygm0nt mentioned this issue Jun 22, 2020

Replace numpy.random.RandomState with SFC64 - for speed #2864

Merged

piskvorky mentioned this issue Jun 25, 2020

KeyedVectors & *2Vec API streamlining, consistency #2698

Merged

mpenkov closed this as completed in #2864 Jun 29, 2020

piskvorky reopened this Jun 29, 2020

SagarDollin mentioned this issue Aug 28, 2021

Updated RandomState (deprecated from numpy) to default_rng (Generator) #3220

Open

piskvorky mentioned this issue Sep 13, 2021

Resolve NumPy compatibility hell #3231

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

random.RandomState with different versions of numpy has vastly different performance #2782

random.RandomState with different versions of numpy has vastly different performance #2782

alexcarterkarsus commented Apr 3, 2020

piskvorky commented Apr 3, 2020 •

edited

Loading

alexcarterkarsus commented Apr 3, 2020

gojomo commented Apr 5, 2020

piskvorky commented Apr 5, 2020 •

edited

Loading

gojomo commented Jun 29, 2020

piskvorky commented Jun 29, 2020 •

edited

Loading

zygm0nt commented Jun 29, 2020

piskvorky commented Jun 29, 2020

SagarDollin commented Aug 26, 2021

piskvorky commented Aug 26, 2021

SagarDollin commented Aug 26, 2021 •

edited

Loading

piskvorky commented Aug 26, 2021 •

edited

Loading

SagarDollin commented Aug 27, 2021 •

edited

Loading

piskvorky commented Aug 27, 2021 •

edited

Loading

SagarDollin commented Aug 30, 2021 •

edited

Loading

random.RandomState with different versions of numpy has vastly different performance #2782

random.RandomState with different versions of numpy has vastly different performance #2782

Comments

alexcarterkarsus commented Apr 3, 2020

piskvorky commented Apr 3, 2020 • edited Loading

alexcarterkarsus commented Apr 3, 2020

gojomo commented Apr 5, 2020

piskvorky commented Apr 5, 2020 • edited Loading

gojomo commented Jun 29, 2020

piskvorky commented Jun 29, 2020 • edited Loading

zygm0nt commented Jun 29, 2020

piskvorky commented Jun 29, 2020

SagarDollin commented Aug 26, 2021

piskvorky commented Aug 26, 2021

SagarDollin commented Aug 26, 2021 • edited Loading

piskvorky commented Aug 26, 2021 • edited Loading

SagarDollin commented Aug 27, 2021 • edited Loading

piskvorky commented Aug 27, 2021 • edited Loading

SagarDollin commented Aug 30, 2021 • edited Loading

piskvorky commented Apr 3, 2020 •

edited

Loading

piskvorky commented Apr 5, 2020 •

edited

Loading

piskvorky commented Jun 29, 2020 •

edited

Loading

SagarDollin commented Aug 26, 2021 •

edited

Loading

piskvorky commented Aug 26, 2021 •

edited

Loading

SagarDollin commented Aug 27, 2021 •

edited

Loading

piskvorky commented Aug 27, 2021 •

edited

Loading

SagarDollin commented Aug 30, 2021 •

edited

Loading