Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index: Fix volk_get_index #519

Closed
wants to merge 3 commits into from
Closed

Conversation

jdemel
Copy link
Contributor

@jdemel jdemel commented Sep 26, 2021

This function results in an infinite loop on Debian 11 for some impls.
This is a first step to fix it.

@mbr0wn , @marcusmueller does that fix the reported issue? If you could point me to a CI script for Debian 11 that we can integrate here, that'd be very helpful.

Fix #516

EDIT
This is a minimal Debian 11 Container to test things.

FROM debian:bullseye

RUN apt-get update \
        && apt-get install -y \
        build-essential git cmake python3-mako python3-distutils liborc-dev

GNU Radio CI container for Debian 11 where volk_profile hangs on volk_32f_8u_polarbutterflypuppet_32f.
The corresponding Dockerfile: Dockerfile.

@marcusmueller
Copy link
Member

Don't we need to handle -1 in the calling functions to avoid calling into random memory?

@michaelld
Copy link
Contributor

Don't we need to handle -1 in the calling functions to avoid calling into random memory?

+1 from me on this: looks like that's just in tmpl/volk.tmpl.c : 191

lib/volk_rank_archs.c Show resolved Hide resolved
lib/volk_rank_archs.c Outdated Show resolved Hide resolved
@jdemel
Copy link
Contributor Author

jdemel commented Oct 2, 2021

I fixed the "file local comments". Thus, we need to discuss how to deal with the -1 return value.
Why did we avoid to deal with it until now? We ended up in an infinite loop and thus terminated useful execution at this point.
Considering this, the actual bug report should be fixed now.

volk_get_index is used in 2 functions:

  1. lib/volk_rank_archs.c:volk_rank_archs: return values propagate here.
  2. tmpl/volk.tmpl.c:186, i.e. in all _manual kernel calls.

volk_rank_archs is used in "one" place.

  • tmpl/volk.tmpl.c:151-152: kernel init. Thus, the first call to any kernel.

Both functions in tmpl/volk.tmpl.c return void. We can't signal errors that way. We'd have to change the API.
We go back and forth between signed and unsigned variables in these functions. But that's a separate issue.

We could treat -1 in both functions as: use the first impl in the list and hope it is supported on your system.
I'd really like to raise a NotImplementedError but that's obviously not an option either.

I know this is an unsatisfactory but we could ignore this problem for now and add it to the list of issues to fix for our next major release. Would that be a viable option?

@jdemel
Copy link
Contributor Author

jdemel commented Oct 2, 2021

I reverted the change where ! is replaced with != 0. It caused issues with most systems except Ubuntu x86. I rebased to the latest main as well.

Copy link
Contributor

@michaelld michaelld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; this change will be an API change (returning -1), so do we want to just have it in a 3.0-beta branch & not related to 2.X at all?

@michaelld
Copy link
Contributor

I reverted the change where ! is replaced with != 0. It caused issues with most systems except Ubuntu x86. I rebased to the latest main as well.

interesting ... I wonder why? Anyway, ! is fine

@michaelld
Copy link
Contributor

This PR fixes the direct issue, which is important. It does change the API unfortunately by now returning -1 where it didn't before. I guess as an interim we could instead have it return the first entry (if there is one) & print a warning noting as such. Of course if there is no entry then we'd want to return -1 ... so, does this help? Hmmm ...

@jdemel
Copy link
Contributor Author

jdemel commented Oct 3, 2021

This PR fixes the direct issue, which is important. It does change the API unfortunately by now returning -1 where it didn't before. I guess as an interim we could instead have it return the first entry (if there is one) & print a warning noting as such. Of course if there is no entry then we'd want to return -1 ... so, does this help? Hmmm ...

It sounds like every solution somehow breaks the API. However, at the moment the broken API won't work at all. After this change, it will. If we'd do another 2.x release, I'd argue this fix should go in there.
I added a message "Volk ERROR ...". If we'd return 0;, we'd most probably cause an "illegal instruction error".

Besides, it might be time to add a Debian test. However, Debian 'bullseye' ships GCC 10. We already have a test with GCC 10. The question would be: Is this distro specific?

} else {
fprintf(stderr, "Volk ERROR: no arch found. generic impl missing!\n");
}
return idx;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're just bubbling up the -1 here in case of error, but we need to handle that. Otherwise, we'll jump into some random memory location when the consumer program actually calls the kernel. That's worse than an infinite loop.

So, this is half the fix so far.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the very least, __init_${kern.name} needs to check for -1 and abort in that case. Failing early instead of wandering into random memory, where we become a stability or security issue, is necessary. Otherwise, we're just making bad code worse!

Copy link
Member

@marcusmueller marcusmueller Oct 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also note that __init_.. casts the int that volk_rank_archs returns into a size_t, which is unsigned, and potentially of a different size!

@marcusmueller
Copy link
Member

By the way, how do we figure out what breaks finding that generic implementation? even if this PR was perfect in every way, it would not solve the underlying issue that leads to #516 . And maybe fixing that would completely rewrite this part of code, anyway...

@jdemel
Copy link
Contributor Author

jdemel commented Oct 4, 2021

Updated the initial PR description.
I started some tests with:

FROM debian:bullseye

RUN apt-get update \
        && apt-get install -y \
        build-essential git cmake python3-mako python3-distutils liborc-dev

I couldn't find liborc-dev and python3-distutils in the GNU Radio CI Docker file.
liborc-dev is essentially a dependency and should be available anyways.
python3-distutils is only a build dependency. So it won't be installed with a binary.

Besides, I can't reproduce the issue in my minimal container with the current main. Debian 11 ships VOLK 2.4.1 which I didn't test yet.

@marcusmueller
Copy link
Member

liborc-dev is essentially a dependency and should be available anyways.

liborc-dev should only be a build-time dependency, right?

@jdemel jdemel marked this pull request as draft October 8, 2021 10:06
@jdemel
Copy link
Contributor Author

jdemel commented Oct 8, 2021

I converted this PR to a draft. At the moment I doubt it really fixes the issue it is supposed to fix. We need a better understanding of the issue here.

This function results in an infinite loop on Debian 11 for some impls.
This is a first step to fix it.

Fix gnuradio#516

Signed-off-by: Johannes Demel <[email protected]>
Instead of recursively calling `volk_get_index`, a new function is
introduced to walk through the list of impls. Then, we introduce a three
step process:
1. Search for the requested impl. Return index if found.
2. Search for the generic impl. Return index if found.
3. Return `-1` and put the burden on the caller.

Signed-off-by: Johannes Demel <[email protected]>
This fix is a bit of a guess. aarch64, MacOS, and Win CI tests all fail
after I changed the comparison from `!` to `!= 0`.

Signed-off-by: Johannes Demel <[email protected]>
@jdemel
Copy link
Contributor Author

jdemel commented Sep 15, 2023

This draft PR was introduced because of a bug that turned out to be outside of VOLK. I'm closing this PR now.

@jdemel jdemel closed this Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

volk_get_index broken, stuck in infinite loop
3 participants