Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix julialang.org fib example for Cython #21

Open
eteq opened this issue Oct 22, 2014 · 14 comments
Open

Fix julialang.org fib example for Cython #21

eteq opened this issue Oct 22, 2014 · 14 comments

Comments

@eteq
Copy link

eteq commented Oct 22, 2014

Right now, the fibbonachi example (visible at http://nbviewer.ipython.org/url/refreweb.phys.ethz.ch/hope/notebooks/julialang.org.ipynb ... for some reason this doesn't match this repository?) reads:

cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def cython_fib(int n):
    if n<2:
        return n
    return cython_fib(n-1)+cython_fib(n-2)

This is a mis-use of Cython, however. If you change it instead to

cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef int cython_fib(int n):
    if n<2:
        return n
    return cython_fib(n-1)+cython_fib(n-2)

That yields a drastic speedup, making the Cython example equivalent to native code (and hope).

The point is that using "def" in tight loops is a mis-use of Cython: it makes Cython functions act just like regular Python functions, rather than using it as a c function with an int return type, which is what makes the recursion here go fast.

@cosmo-ethz
Copy link
Collaborator

@eteq Thanks for pointing this out.
I wasn’t aware of this hybrid function declaration (as of our target audience is probably neither).
I also don’t fully understand what it does and when it should be used i.e. why this isn’t the default behavior as the overhead should be tiny.

This being said, I’ve rerun the recursive benchmarks and I indeed see a performance gain. However, for fibonacci it’s relatively far from the 20-fold performance gain you mentioned. For the quicksort it’s only a few percent.

@eteq
Copy link
Author

eteq commented Oct 22, 2014

@cosmo-ethz - A fair question why it isn't the default. Maybe backwards-compatibility? Or perhaps there's some subtle corner case where you don't want the hybrid behavior? Someone on cython-devel would probably know.

Hmm, I just ran it now locally on OS X 10.9 (I was using ubuntu before), and saw this:

function: hope_fib            , av. time sec:   0.00004005, min. time sec:   0.00003910, relative:       1.0
function: cython_fib2         , av. time sec:   0.00004721, min. time sec:   0.00004697, relative:       1.2
function: native_fib          , av. time sec:   0.00005007, min. time sec:   0.00004911, relative:       1.2
function: cython_fib          , av. time sec:   0.00061798, min. time sec:   0.00060487, relative:      15.4
function: fib                 , av. time sec:   0.00281692, min. time sec:   0.00275493, relative:      70.3
function: numba_fib           , av. time sec:   0.00281906, min. time sec:   0.00272179, relative:      70.4

Where cython_fib2 is the one with the added cpdef - so that's more like 12x faster, but the key point is that it's basically the same as native c (as it should be).

On qsort, I saw the same as you - not much speedup. I think this is because it has to spend a lot of time passing around numpy arrays when you use this syntax. If I were doing it for speed I would have the recursion in a cdef function that just passes the data buffer array instead of the full numpy object, and "wrap" that in a python function. But I realize that's beyond the scope of what you're trying to demonstrate here.

A little off-topic, but some of the other examples are sort of mis-using Cython, which is why it sometimes comes in similar to regular python. I found most of these things using the cython -a [file] command: if you don't know it, it makes an html file that shows which lines are slow (usually because there's some python object<->C type casting), and with that I usually can get cython code to match native code. All that said, I totally understand that the point of hope is to avoid having to do any such modification! But sometimes you can't beat getting closer to metal, and I think making these benchmarks as fair as possible (modulo time constraints) help demonstrate when that makes sense to do...

@cosmo-ethz
Copy link
Collaborator

@eteq i’ve rerun the benchmark again but I don’t see the same speedup as you do. Which version of Python and Cython are you using for your tests?

Ok, thanks. I have to look into that.

@eteq
Copy link
Author

eteq commented Oct 23, 2014

@cosmo-ethz - I was using py2.7 for both the Ubuntu (~20x speedup) and OS X (~15x speedup) cases. On OS X I have Cython 0.21 (the latest), and I think Ubuntu is 0.20? (I don't have access to that machine right now). It's whatever version is the in the Ubuntu 14.04 package manager, anyway.

@eteq
Copy link
Author

eteq commented Oct 23, 2014

@cosmo-ethz - One possible problem I encountered while testing this: oddness occurs if you use the same function names for both in an ipython notebook. I think what was happening was that it compiled the cpdef version, but the calls in that version went to the def case that had previously been compiled (perhaps some weirdness with ipython's Cython caching?) So maybe just try something like this:

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
cpdef int cython_fib2(int n):
    if n<2:
        return n
    return cython_fib2(n-1)+cython_fib2(n-2)

@nparley
Copy link

nparley commented Dec 4, 2014

For what it's worth, I also did not know about the hybrid function declaration but was aware of the fact that I would loose time by calling python functions and not cython ones. Therefore I had been solving these problems by making a wrapper to the real function, for example:

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
cdef int cython_fib_c(int n):
    if n<2:
       return n
    return cython_fib_c(n-1)+cython_fib_c(n-2)

def cython_fib(n):
    return cython_fib_c(n)

Timing wise I get:

10000 loops, best of 3: 20.6 µs per loop - cython
10000 loops, best of 3: 26.7 µs per loop - hope

@cosmo-ethz
Copy link
Collaborator

@nparley That's a nice work around.

I've just pushed a new version of the benchmarks, where we adapted the Cython code

@nparley
Copy link

nparley commented Dec 4, 2014

@cosmo-ethz I am not sure if you edited the sort as well but this is my quick cython edit (pretty much what @eteq said above)

%%cython

cimport cython
import numpy as np
cimport numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
cdef void cython_qsort_kernel_c(double [::1] a, int lo, int hi):
    cdef int i = lo
    cdef int j = hi
    cdef double pivot = 0
    cdef double tmp = 0.0
    if False: return a
    while i < hi:
        pivot = a[(lo+hi) // 2]
        while i <= j:
            while a[i] < pivot:
                i += 1
            while a[j] > pivot:
                j -= 1
            if i <= j:
                tmp = a[i]
                a[i] = a[j]
                a[j] = tmp
                i += 1
                j -= 1
        if lo < j:
            cython_qsort_kernel_c(a, lo, j)
        lo = i
        j = hi
    return

def cython_qsort_kernel(double [::1] a, int lo, int hi):
    cython_qsort_kernel_c(a, lo, hi)
    return a

The times here:

1000 loops, best of 3: 321 µs per loop - cython
1000 loops, best of 3: 280 µs per loop - hope

Hope looks a really nice piece of work :)

@eteq
Copy link
Author

eteq commented Dec 4, 2014

@nparley - yep, that does the trick too. While the first case (the fibonacci one) is just as fast if you just substitute cpdef, your implementation is probably better for qsort - using cpdef there is slower because it has to wrap/unwrap numpy arrays as memoryview objects.

@eteq
Copy link
Author

eteq commented Dec 4, 2014

Just to be clear, though, @cosmo-ethz, cython's docs clearly advocate the use of cpdef where it makes sense. If I take @nparley's fibonacci example and change cdef to cpdef, calling the c version from python is just as fast (possibly a tiny bit faster in some cases) as @nparley's version with two separate functions (and it's cleaner code).

@nparley
Copy link

nparley commented Dec 5, 2014

@eteq yes I agree that cpdef is cleaner. I guess the naming convention makes sense too:

- def = function called from python
- cdef = function called from c
- cpdef = function called from c or python

@cosmo-ethz
Copy link
Collaborator

@nparley thanks for putting together this example how to pass around the array buffers with cython. The impact on the performance is quite impressive. (The function signature becomes somewhat ... interesting 👍 )

However, I think its clear that with sufficient time, expertise and tinkering it is possible to improve the Cython-code performance to C-level code. From a certain point on, though, the code resembles C code very much. So in that case it’s questionable if not going to C directly would be simpler. I guess this depends on the taste and skills in the different languages.

@eteq
Copy link
Author

eteq commented Dec 5, 2014

@cosmo-ethz - I'm with you on the idea that hope is clearly the clearest (in the sense of "more python like"). But I think just expecting Cython to behave like hope well in this way i somewhat misrepresentings what Cython is for. So I think your point is made just as well by using Cython the way it is meant to be used (optimized and thus complex), while making it clear that this tools all have uses depending on your preferences/code situation.

@nparley
Copy link

nparley commented Dec 6, 2014

I agree. I think the comparisons should be done with the different versions optimised. A lot of people are not going to want to learn how to use cython and hope and numbra as great tools here. Of course cython has openmp support which is really useful when you have access to multi core machines to do your number crunching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants