Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double-check function invocation overhead for NumPy and PyData/Sparse #19

Open
stephenchouca opened this issue Mar 15, 2021 · 5 comments

Comments

@stephenchouca
Copy link
Collaborator

At some point (if feasible), it might be a good idea to measure how much overhead NumPy and PyData/Sparse incur when functions are invoked. That way, we can verify that any observed difference between NumPy and TACO is not just due to the overhead of interpreting Python functions or something trivial like that. This might require modifying NumPy's and PyData/Sparse's source code to insert some timers inside the functions we are benchmarking.

@rohany
Copy link
Collaborator

rohany commented Mar 15, 2021

As per the numpy documentation:

In NumPy, universal functions are instances of the numpy.ufunc class. Many of the built-in functions are implemented in compiled C code. The basic ufuncs operate on scalars, but there is also a generalized kind for which the basic elements are sub-arrays (vectors, matrices, etc.), and broadcasting is done over other dimensions. One can also produce custom ufunc instances using the frompyfunc factory function.

@stephenchouca
Copy link
Collaborator Author

When we benchmarked sparse tensor addition in TensorFlow (which also implements kernels in compiled code) for a previous paper, what we found was that maybe only ~10% of the runtime was actually spent on the addition while the rest was split between validating the inputs and actually dispatching the compiled code. NumPy and PyData/Sparse might not have similar issues, but that's something we might want to double check at some point.

@rohany
Copy link
Collaborator

rohany commented Mar 16, 2021

I see. We should definitely check then.

@weiya711
Copy link
Collaborator

I was talking to Fred about this on Thursday and we both agreed that we definitely need to check this. It will be useful for the analysis of "Why is TACO faster than Numpy/PyData?" and "What're the causes of the speedup and what's the breakdown contributing to the speedup/slowdown of the systems?" etc. We believe that since PyData/Sparse is still using the dense Numpy calls, it's spending most of the time breaking down the tensor into dense vectors for each iteration space. That code is currently (most likely) implemented in Python so we need to be able to pinpoint what PyData/Sparse is doing (i.e. how much of it is due to Python interpretation, how much is caused by the splitting/re-merging algorithm, and how much is due to the Numpy function call).

@rohany
Copy link
Collaborator

rohany commented Mar 16, 2021

I looked into this a little bit. Here is the flamegraph for running logical_xor (rename the file to .svg and then view it in your browser):

ufunc_stats_fg.txt

It doesn't look like there is any sort of interpretation overhead. pydata/sparse spends most of its time in some list comprehension, followed by jit compiling and calling into some generated code via numba. I'm not very familiar with how pydata/sparse actually does the sparse ufunc operation, so I can't attempt to match these up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants