-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Action hangs while writing on DDR #887
Comments
Hi Abbas
|
Hi Bruno, |
Hi Abbas, this seems very similar to issue #882. |
Yes I am using AD9V3. |
Thanks to your code, after some analysis, for some reasons, I am continuing investigations to understand what drives to this condition, and if there is a relationship with the reset you get. Thanks |
Hi @abbasBSC |
Hi Bruno, |
Sorry my fault. please read |
you are right Bruno when writing to DDR some situation prevents axi_card_mem0_bready to be one. I will force this signal to 1 always. But why this kind of problem should cause a reset? |
Thanks Abbas. Did you observed this reset in simulation or on real hardware only? On my side, I haven't been able to observe any reset occurring in simulation else than the timeout which stops the action. From what I know from AXI, there is no retry or timeout or reset, if the slave takes more time than expected, or never answers. |
I have seen reset in simulation but not with this code, with other codes. |
In my code, when writing to DDR stops (due to bready signal) it also stops reading from host DDR because the FPGA FIFO will be full and do not accept new data. Hence, the axi bus on capi side hangs. Based on my observations, the reset problem occurs when the axi bus on capi side hangs and never finishes. Then on timeout, when it tries to detach action it will cause the reset problem. It might not be the only case that causes PCIe reset but for sure is one of them as I tested with different codes both on simulation and real system. To solve PCIe reset problem I tried to reset action code and axi_read/write modules by setting a bit in ACTION_CONTROL registers after each action timeout or action idle (before detach) and use it as a reset to my FPGA modules. It solved the problem in some cases but not always. When axi hangs, axi_read/write modules go to a state and never come out. But by resetting them on timeout they will go to their initial/idle state before action detach (and also in next run start from initial/idle state). But this is only one side of the axi bus. I am not able to reset the the other side which is psl part. So if psl is trapped in a state and no one reset it after timeout, in next run it will continue from its previous unknown state. So it is not only about when axi (on capi side) hangs, it could happen if your action is timed out while it is sending or receiving data through axi. In this case even your next runs are not valid because probably psl state machine is in some undesired state, where it was stopped in previous run. That's why some times you might face reset problem even if your axi is working correctly. In a nutshell, if axi bus on capi side does not work properly, it would be possible to get a PCIe reset. In my opinion resetting all FPGA modules after each action idle or timeout should solve the problem. |
Try to run "capi-reset 0 user" when you timed out. This will reload the FPGA image and also put PSL into an initial state. But I still want to understand what happened on DDR interface? Why axi_card_mem0_bready doesn't go up? |
You get "capi-reset" after you clone https://github.com/ibm-capi/capi-utils and |
While writing to DDR in some situations, my code was not able to set bready again. It was a mistake in my code. This situation never happens in writing to host mem. |
Yes it is a manual reset. It is not nice to reset FPGA card manually after each timeout. Besides I think reset should happen before action detach to avoid PCIe reset. Some where in snap code. |
Hi @abbasBSC . |
Hi, |
snap_maint is just to "discover" the action(s) that can be in a card, so it has to be executed only once. When you call it a 2nd time, it doesn't even try to re-access the card. |
it makes sense. But if one action expires, it affects the functionality of other actions as well as the expired action itself in next run since capi modules are not in idle state any more. |
Understand and agree. This was my main difficulty when building the actions/hls_latency, being sure that the action had a timeout in any way to come back to the initial state. |
Hi all,
I have written an HDL code functions some how as hls_memcpy. it reads and/or writes data from/to host memory or FPGA card DDRs.
I faced a problem when writing to DDR. when I run application for first time it writes to DDR and action finishes, but in 2nd run the action hangs and sometimes I need to run snap_maint again (in simulation it works properly). the main problem is that last time snap_maint gave this error back:
Error: Can not open CAPI-SNAP Device: /dev/cxl/afu0.0m
I ran snap_maint many times and still same error. unfortunately FPGA card is not detected anymore! shall I flash FPGA with factory bitstream?
why this should happen!
The text was updated successfully, but these errors were encountered: