Double-check function invocation overhead for NumPy and PyData/Sparse #19

stephenchouca · 2021-03-15T01:59:00Z

At some point (if feasible), it might be a good idea to measure how much overhead NumPy and PyData/Sparse incur when functions are invoked. That way, we can verify that any observed difference between NumPy and TACO is not just due to the overhead of interpreting Python functions or something trivial like that. This might require modifying NumPy's and PyData/Sparse's source code to insert some timers inside the functions we are benchmarking.

rohany · 2021-03-15T05:03:04Z

As per the numpy documentation:

In NumPy, universal functions are instances of the numpy.ufunc class. Many of the built-in functions are implemented in compiled C code. The basic ufuncs operate on scalars, but there is also a generalized kind for which the basic elements are sub-arrays (vectors, matrices, etc.), and broadcasting is done over other dimensions. One can also produce custom ufunc instances using the frompyfunc factory function.

stephenchouca · 2021-03-16T02:06:09Z

When we benchmarked sparse tensor addition in TensorFlow (which also implements kernels in compiled code) for a previous paper, what we found was that maybe only ~10% of the runtime was actually spent on the addition while the rest was split between validating the inputs and actually dispatching the compiled code. NumPy and PyData/Sparse might not have similar issues, but that's something we might want to double check at some point.

rohany · 2021-03-16T02:08:32Z

I see. We should definitely check then.

weiya711 · 2021-03-16T03:21:18Z

I was talking to Fred about this on Thursday and we both agreed that we definitely need to check this. It will be useful for the analysis of "Why is TACO faster than Numpy/PyData?" and "What're the causes of the speedup and what's the breakdown contributing to the speedup/slowdown of the systems?" etc. We believe that since PyData/Sparse is still using the dense Numpy calls, it's spending most of the time breaking down the tensor into dense vectors for each iteration space. That code is currently (most likely) implemented in Python so we need to be able to pinpoint what PyData/Sparse is doing (i.e. how much of it is due to Python interpretation, how much is caused by the splitting/re-merging algorithm, and how much is due to the Numpy function call).

rohany · 2021-03-16T04:36:50Z

I looked into this a little bit. Here is the flamegraph for running logical_xor (rename the file to .svg and then view it in your browser):

ufunc_stats_fg.txt

It doesn't look like there is any sort of interpretation overhead. pydata/sparse spends most of its time in some list comprehension, followed by jit compiling and calling into some generated code via numba. I'm not very familiar with how pydata/sparse actually does the sparse ufunc operation, so I can't attempt to match these up.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Double-check function invocation overhead for NumPy and PyData/Sparse #19

Double-check function invocation overhead for NumPy and PyData/Sparse #19

stephenchouca commented Mar 15, 2021

rohany commented Mar 15, 2021

stephenchouca commented Mar 16, 2021

rohany commented Mar 16, 2021

weiya711 commented Mar 16, 2021

rohany commented Mar 16, 2021

Double-check function invocation overhead for NumPy and PyData/Sparse #19

Double-check function invocation overhead for NumPy and PyData/Sparse #19

Comments

stephenchouca commented Mar 15, 2021

rohany commented Mar 15, 2021

stephenchouca commented Mar 16, 2021

rohany commented Mar 16, 2021

weiya711 commented Mar 16, 2021

rohany commented Mar 16, 2021