Fix check for retransmission of discarded block segments #546

samsamfire · 2024-11-06T21:23:38Z

Hi,

I noticed this issue whilst working on another canopen package.
The block upload retransmit does not work correctly.
On the event that a client does not properly receive a sub-block, it sends an end sub-block message with the last acknowledged segment number.
All the frames between ackseq and blksize sent by the server should be ignored (this is currently the case).
However, the server will start resending the missed frames (between ackseq and blksize) at the beginning of the new block, so at seqno==1.
This is difficult to test within the library as there is no sdo server supporting block transfer, but I have tested it against another implementation and it works OK.

…ew block

codecov-commenter · 2024-11-06T21:24:42Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.84%. Comparing base (ffbd10f) to head (ab4d150).
Report is 1 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #546      +/-   ##
==========================================
+ Coverage   71.36%   71.84%   +0.47%     
==========================================
  Files          26       26              
  Lines        3129     3129              
  Branches      480      480              
==========================================
+ Hits         2233     2248      +15     
+ Misses        765      752      -13     
+ Partials      131      129       -2

Files with missing lines	Coverage Δ
canopen/sdo/client.py	`75.27% <100.00%> (+3.25%)`	⬆️

samsamfire · 2024-11-26T09:12:42Z

Hello,

What are your thoughts ?

acolomb · 2024-11-26T10:36:23Z

Sorry, didn't find the time to look at it yet. Will do soon when possible.

acolomb · 2025-01-13T14:47:38Z

I've tried to wrap my head around this part of the standard, but I really cannot judge from just reading it whether this is more correct. Sorry, I have very limited hands-on experience with SDO block transfers, so not easy to see what's going on. So I'm hesitant about merging the change without further understanding what it actually fixes.

Could you maybe try to record a bus log which triggers this condition, from a correctly behaving client? Then we could add that as an expected message exchange in the test_sdo.py file and validate in a test case. There are lots of examples there which do validate the generated CAN objects, so this should be well testable.

samsamfire · 2025-01-13T21:38:41Z

Hi,

Completely agree that a test should be added, the problem is that we currently don't have an sdo server supporting block transfer. Also I don't think we could see this just with the CAN frames because the protocol part is correct, what's happening is that some frames are getting ignored on the client side, which results in a wrong CRC at the end of the transfer.

Let me try to re-phrase what the current problem is and add an example.
When doing an SDO block upload, the server sends blocks of data, each with a predefined size. For this example i'll take 127 which is the maximum block size.
The block is composed of frames which all start with a sequence number going from 1 to 127. The client expects 127 frames to arrive in order to validate the block, and expects them to arrive in the correct order 1-127.
If for some reason a block segment is lost or arrives in an incorrect order, the client will acknowledge the last "good" segment.
So if a client sees segments starting with :
1,2,3,4,6,5,7,8 it will reply as soon as it detects the problem with the last good segment which is 4.
This means that the SDO server should retransmit all the un-acknowledged segments starting with segment 5.
Failure to do so will result in an incorrect CRC at the end of the block transfer.

The current implementation of retransmit looks like this :

    def _retransmit(self):
        logger.info("Only %d sequences were received. Requesting retransmission",
                    self._ackseq)
        end_time = time.time() + self.sdo_client.RESPONSE_TIMEOUT
        self._ack_block()
        while time.time() < end_time:
            response = self.sdo_client.read_response()
            res_command, = struct.unpack_from("B", response)
            seqno = res_command & 0x7F
            if seqno == self._ackseq + 1:
                # We should be back in sync
                self._ackseq = seqno
                return response
        self._error = True
        self.sdo_client.abort(0x05040000)
        raise SdoCommunicationError("Some data were lost and could not be retransmitted")

We are waiting for the sequence number to be the same as the last good known sequence number, to start considering the messages. However, this is wrong because the SDO server will start sending the discarded segments at the start of a new block.
Simplified example of what is happening :

SERVER

[TX] 1...
[TX] 2...
[TX] 3...
[TX] 4...
[TX] 6... ==> Wrong seqno received (can be client or server's fault)
[TX] 5...
... Can continue sending rest of block
CLIENT

[TX] 4... ==> Last good segment is 4

SERVER

[TX] 1... ==> This corresponds to data of seqno "5" of previous block
[TX] 2...
[TX] 3...
[TX] 4... ==> This is where the current implementation considers to be back in sync, which is wrong.
...
[TX] 127...
CLIENT

[TX] 127 ==> Complete block received successfully

I hope this makes things clearer.

…entation but will fail with an invalid CRC without fix for discarded segments.

samsamfire · 2025-01-15T10:56:23Z

Hello,

I've added a test for SDO block transfer retransmit, this took me a bit of time.
This test passes with the current fix, but will fail with an invalid CRC with the current implementation, because some blocks are ignored by client when they shouldn't be, as discussed previously.

samsamfire · 2025-02-25T09:57:37Z

Hello,

It would be great to have some feedback.

friederschueler · 2025-02-25T14:17:45Z

Hi Samuel, I will try to have a look at your commit this week if @acolomb has no time for it.

acolomb

Sorry for the long wait. I haven't had as much spare time as I had hoped for this project, and there is a bit of backlog.

I've re-read the protocol description and I think I now understand better what this fix is doing. It looks correct from my side, but again, I haven't been able to test it.

As for the unit test, one small issue was unclear even with the added comment. And could the test data be shortened? Can we somehow force a lower blksize parameter in the client for this test execution, so fewer frames are required?

I'm also wondering whether there is a chance that the client might send the response (acknowledging the last good sequence number before failure) earlier, without waiting for the rest of the block being uploaded from the server. This will happen with our client implementation, right? So what does the unit test do with the extra RX frames?

test/test_sdo.py

acolomb · 2025-02-25T21:37:07Z

canopen/sdo/client.py

-            if seqno == self._ackseq + 1:
+            if seqno == 1:


I'm wondering why the self._ack_block() call doesn't simply reset the self._ackseq attribute to zero in all cases. Then this check would be fine as is?

Just thinking out loud, let's hear your thoughts on why this is the better place to fix it.

@samsamfire Are you still onto this?

samsamfire · 2025-02-26T12:40:44Z

Hi, no problem.

For the test data, this is purely because of my setup and the client that I have (the data is taken from a real transaction). It is a bit "long" in lines, but it's not long to execute and I think it's pretty representative of the behavior of CANopen nodes (127 sized blocks are standard). We could probably create the frames programmatically but it would ruin readability.

For the "extra" frames, the test was to make sure they were indeed "properly" ignored. The server behavior can differ here as it might wait for the hole block to be transmitted before dealing with the acknowledge block from the client.

acolomb · 2025-04-27T21:26:43Z

I wouldn't want the test data to be programmatically generated while running the test. However, a smaller amount would still be preferred. If you can produce that, fine. If it's impossible on your hardware, please see if you can emulate it somehow to get this data more focused on the actual errors being tested. And last resort, we can use the data as is, but I'd really like to see if it can be shortened easily first.

Please also check out the review conversations @samsamfire, some questions there are still unanswered.

Just like the other test cases. Remove unnecessary initialization of local variable.

acolomb · 2025-06-09T18:57:34Z

Ping @samsamfire, I would like to move forward with this fix, but waiting on your replies. Meanwhile, I applied some small cleanup commits to make this fit in better with the surrounding code.

And I haven't been able to reproduce the data frames in any sensible way. Could you please try to redo them with shorter data (and thus lower blksize)?

…-fix

codecov · 2025-06-10T21:06:05Z

Codecov Report

All modified and coverable lines are covered by tests ✅

📢 Thoughts on this report? Let us know!

…re was an an inversion, even if it doens't make any difference

samsamfire · 2025-06-10T21:47:00Z

Hello @acolomb ,
I've made the slight changes concerning comment and the use of ack_block function to directly reset counter.
Concerning the data, I'm afraid I don't have the time to redo a different test easily, capture the data correctly etc.
I still think it makes for a good test, if the size is such a big problem, one option would be putting it in a seperate "raw binary file".
Also, on the test docstring I would keep what the test does. Removing it just removes some useful information.

acolomb

Thanks for keeping up with the review change requests!

I believe this is now correct and the all tests still pass even with the different fix approach. Thus let's merge and see what this does out in the wild.

test/test_sdo.py

Fixed : retransmission of discarded segments starts at beginning of n…

1a74fb6

…ew block

samsamfire force-pushed the sdo-block-upload-retransmit-fix branch from 4725972 to 5341142 Compare January 15, 2025 10:41

Test : add a block retransmit test. This test passes with this implem…

33aa620

…entation but will fail with an invalid CRC without fix for discarded segments.

samsamfire force-pushed the sdo-block-upload-retransmit-fix branch from 5341142 to 33aa620 Compare January 15, 2025 10:52

samsamfire added 2 commits January 15, 2025 11:57

Comment : fixed wrong value in comment

7fe2c71

Comment : better comments on the different steps of the block

ab4d150

acolomb reviewed Feb 25, 2025

View reviewed changes

acolomb added 3 commits June 9, 2025 20:50

Fix docstring.

2692989

Fix quoting style to match surrounding code.

43e9377

Use test framework assertion instead of plain assert.

ff12cb4

Just like the other test cases. Remove unnecessary initialization of local variable.

Merge branch 'canopen-python:master' into sdo-block-upload-retransmit…

7822ca4

…-fix

samsamfire added 3 commits June 10, 2025 23:19

Fixed : moved comments up (wrong line)

df4af56

Changed : replace double x34 with x34/x33 to make it clearer that the…

37c6e4b

…re was an an inversion, even if it doens't make any difference

Changed : reset _ackseq every block

a0a40b8

acolomb approved these changes Jun 11, 2025

View reviewed changes

test/test_sdo.py Outdated Show resolved Hide resolved

Update test/test_sdo.py

93ed50b

acolomb changed the title ~~Fixed : retransmission of discarded segments starts at beginning of new block~~ Fix check for retransmission of discarded block segments Jun 11, 2025

acolomb merged commit 7ddb19b into canopen-python:master Jun 11, 2025
4 of 5 checks passed

Fix check for retransmission of discarded block segments #546

Fix check for retransmission of discarded block segments #546

Uh oh!

Conversation

samsamfire commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Nov 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

samsamfire commented Nov 26, 2024

Uh oh!

acolomb commented Nov 26, 2024

Uh oh!

acolomb commented Jan 13, 2025

Uh oh!

samsamfire commented Jan 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

samsamfire commented Jan 15, 2025

Uh oh!

samsamfire commented Feb 25, 2025

Uh oh!

friederschueler commented Feb 25, 2025

Uh oh!

acolomb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

acolomb Feb 25, 2025

Choose a reason for hiding this comment

Uh oh!

acolomb Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

samsamfire commented Feb 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

acolomb commented Apr 27, 2025

Uh oh!

acolomb commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 10, 2025

Codecov Report

Uh oh!

samsamfire commented Jun 10, 2025

Uh oh!

acolomb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

samsamfire commented Nov 6, 2024 •

edited

Loading

codecov-commenter commented Nov 6, 2024 •

edited

Loading

samsamfire commented Jan 13, 2025 •

edited

Loading

samsamfire commented Feb 26, 2025 •

edited

Loading

acolomb commented Jun 9, 2025 •

edited

Loading