-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
random.RandomState with different versions of numpy has vastly different performance #2782
Comments
Thanks for reporting. What was the difference in We should replace |
what I found is that |
I don't think it's necessary (or reasonable for users to expect) that the same seeding initializes vectors identically across changed gensim versions. (The starting state shouldn't be particularly important for end results under realistic uses – such as multithreaded training, which introduces lots more uncontrollable randomness.) So we should just choose whatever |
Agreed. A 60x slowdown is really worrying. |
Ultimately any other uses of |
@zygm0nt do you think you could take care of all the places gensim uses the old (slow) numpy RNG? |
Sure, I can take care of that. Should this be done in this ticket? by reopening it? |
Yeah, let me reopen. Thanks! |
Hi, @zygm0nt @piskvorky , Thanks, |
Sure! The code style and instructions for Gensim are here: https://github.com/RaRe-Technologies/gensim/wiki/Developer-page |
Hi @piskvorky @gojomo, What can we use to replace RandomState? Will making the above change introduce new bugs? To deal with any dependencies, I did some digging into the code, and I found few instances that have a dependency on
You can see more info about these methods here in the documentation. Note: Here, in any line of code I have mentioned with To Conclude |
Sounds good to me, thanks. After the replacement, can you do a sanity check re. performance? The new code should be faster (or at least not slower) than the existing code. I don't expect the overall impact will be too large (RNG is just a small part of the ML algos), but it would still be nice to include some concrete numbers in the release notes. |
Hey @piskvorky,
Thanks, |
A summary of your benchmark as part of the PR description is enough. Nothing fancy or formal – really, a rudimentary sanity check. Thanks. |
This is regarding the issue piskvorky#2782 . Here are the benchmarks of before and after updating: Before Update After Update Poincare Ran 42 tests in 0.418s Ran 42 tests in 0.417s test_lda Ran 48 tests in 223.845s Ran 48 tests in 225.561s utils Ran 24 tests in 0.007s Ran 24 tests in 0.007s test_matutils Ran 18 tests in 0.071s Ran 18 tests in 0.070s word2vec Ran 79 tests in 58.149s Ran 79 tests in 57.950s I don't find a big difference in time taken. However I feel it is good to be updated along with numpy.
Hi @piskvorky,
Note: I had to change few test files as well, as some tests were relying on hardcoded variables, and since we are using a new Random Thanks. |
the performance of random.RandomState in word2vec.py (version 3.8.0)
seemingly depends greatly on the version of numpy installed. With numpy = 1.14.3, the following code
produced
0.28105926513671875
exactly the same code with numpy= 1.18.1 produced
18.590345859527588
I noticed this because I was training a model with millions of words as vocabulary, and after updating numpy unwittingly (via a anaconda update), I noticed that the time for build_vocab was significantly longer, and after some debugging, I nailed it down to random.RandomState in the
seeded_vector
function.I know this is indeed a numpy issue, but even they mentioned it that RandomState is legacy (https://docs.scipy.org/doc/numpy/reference/random/performance.html). Therefore I wonder if you have some plans to upgrade randomstate? Thanks!
The text was updated successfully, but these errors were encountered: