Skip to content

Commit a492f26

Browse files
committed
exanic: avoid receiving corrupted frames when lapped during memcpy
Under certain timing conditions, it is possible for `exanic_receive_frame` and `exanic_receive_chunk` (and friends) to return corrupt data without reporting an error. We used the experimental setup described below to quickly (within tens of seconds) reproduce this issue. Two UDP senders were pointed to a UDP receiver, sending checksummed UDP packets at close to line rate. We found using one sender did not reproduce the issue reliably. For posterity, we used cached PIO sends from a Solarflare NIC for this step. Sender A sent a 393-byte payload full of `p` (hex `70`), while Sender B striped a 1371-byte payload with all bytes in the range [0, 255]. There's no particular signifiance to these numbers or payloads other than that they were easy to test with. On another machine, the UDP receiver process used `exanic_receive_frame` to receive packets. Upon reception, it verified the UDP checksum, and reported an error if is was invalid. (A variant using `exanic_receive_chunk` exhibited the same bug.) Between each packet, the receiver busy-looped for 30us to simulate application load. The issue does not reproduce within a reasonable timeframe if the delay is too large or too small. 30us was picked as being the ~smallest possible delay on our hardware such that the receiver lapped (i.e., `exanic_receive_frame` returned `-EXANIC_RX_FRAME_SWOVFL`) around 10 times a second. Running the experiment outlined above, the receiver quickly received a bad-checksum UDP packet. A example of one is shown below (redacted MAC and IP). 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 00000000 XX XX XX XX XX XX XX XX XX XX XX XX 08 00 45 00 |XXXXXXXXXXXX..E.| 00000010 05 6d 00 00 40 00 10 11 e8 2b XX XX XX XX XX XX |.m..@....+XXXXXX| 00000020 XX XX af af 1b 9e 05 59 7d 14 00 01 02 03 04 05 |XX.....Y}.......| 00000030 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 |................| 00000040 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 |pppppppppppppppp| 00000050 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 70 |pppppppppppppppp| 00000060 70 70 70 70 70 70 70 70 70 0d 1e 66 1b 00 00 00 |ppppppppp..f....| 00000070 9e 9f a0 a1 a2 a3 a4 a5 4e 4f 50 51 52 53 54 55 |........NOPQRSTU| ... 00000560 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 |6789:;<=>?@abcde| 00000570 46 47 48 49 4a 4b 4c 4d 4e 4f 50 e0 e3 ef 2c |FGHIJKLMNOP...,| Some narration: - The first chunk of this packet begins at `0x00` and ends at `0x77` (120). - At `0x2a` we see Sender A's payload, `00`, `01`, ..., `15`. - At `0x40`, the packet is corrupted with part of the contents of another packet from Sender B: a stream of `70`. - At `0x69` that we see a 32-bit FCS `0d 1e 66 1b` (matching the expected FCS of the Sender B's `70`-only packets). - Presumably the hardware writes at least 32-bit blocks, so we furthermore see padding of `00 00 00` to round up to `0x70`. - Between `0x70` and `0x77` we see whatever was left over in the chunk that got overwritten by sender B's final chunk. - From `0x78` onwards (i.e, the next chunk), Sender A's packet's contents resume uncorrupted. In short: the first chunk of the packet is corrupted from `0x40` to `0x77`. Minor note: the chunk is actually corrupted *backwards*. On the system under test, glibc chooses `__memcpy_ssse3_back` as its `memcpy` implementation. This implementation copies bytes back-to-front, which is why the beginning of the chunk is fine but the end is not. Given that the relevant code has a `TODO` around the buggy region, I imagine the potential for this issue was known, but the probability of its occurrence thought to be very slim. Hopefully, the example above shows it is easy to reproduce, both in test and in production. In this commit, we check that the generation number of the preceding chunk has not changed after `memcpy` returns (as that would mean we may have gotten lapped), and return an error instead. It is necessary to check the generation of the preceding chunk as opposed to re-checking the generation of the current chunk since 128-byte DMAs are not atomic (even though they fit within a single TLP -- [1] suggests the maximum possible atomic width to be 64 bytes; on our hardware we are occasionally seeing less than this). Our understanding is that this should be safe even in the case of DMA TLP reordering on the PCIe bus, as writes will still be made visible to the host in increasing address order (so, if the contents of chunk N are in the process of being changed, chunk N-1's generation number has changed even if chunk N's hasn't yet). A similar issue in the kernel driver was addressed in 2016 as part b3257af. Note that the fix there is to re-check the generation of the current chunk after copying it; this is not sufficient to entirely eliminate the race (though it does, in practice, make it significantly less likely to reproduce). A fix for the kernel driver is not included in this commit. [1]: https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/pcie-burst-transfer-paper.pdf
1 parent a14b39f commit a492f26

File tree

2 files changed

+57
-13
lines changed

2 files changed

+57
-13
lines changed

libs/exanic/fifo_rx.c

Lines changed: 55 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -308,12 +308,16 @@ void __exanic_rx_catchup(exanic_rx_t *rx)
308308
if (eof_found)
309309
{
310310
/* Set next_chunk to the chunk after the most recent end-of-frame */
311+
rx->sentinel_chunk = eof_chunk;
312+
rx->sentinel_chunk_generation = eof_gen;
311313
rx->next_chunk = eof_chunk + 1;
312314
rx->generation = eof_gen;
313315
}
314316
else
315317
{
316318
/* Set next_chunk to where generation number changes */
319+
rx->sentinel_chunk = break_chunk;
320+
rx->sentinel_chunk_generation = break_gen;
317321
rx->next_chunk = break_chunk + 1;
318322
rx->generation = break_gen;
319323
}
@@ -339,10 +343,11 @@ ssize_t exanic_receive_frame(exanic_rx_t *rx, char *rx_buf, size_t rx_buf_size,
339343
{
340344
size_t size = 0;
341345

342-
/* Next expected packet */
346+
/* Next expected chunk */
343347
while (1)
344348
{
345-
const char *payload = (char *)rx->buffer[rx->next_chunk].payload;
349+
uint32_t current_chunk = rx->next_chunk;
350+
const char *payload = (char *)rx->buffer[current_chunk].payload;
346351

347352
/* Advance next_chunk to next chunk */
348353
rx->next_chunk++;
@@ -360,9 +365,22 @@ ssize_t exanic_receive_frame(exanic_rx_t *rx, char *rx_buf, size_t rx_buf_size,
360365
return -EXANIC_RX_FRAME_TRUNCATED;
361366

362367
memcpy(rx_buf + size, payload, u.info.length);
363-
size += u.info.length;
364368

365-
/* TODO: Recheck that we haven't been lapped */
369+
/* Move the sentinel chunk forward. */
370+
uint32_t sentinel_chunk = rx->sentinel_chunk;
371+
uint8_t sentinel_chunk_generation = rx->sentinel_chunk_generation;
372+
rx->sentinel_chunk = current_chunk;
373+
rx->sentinel_chunk_generation = u.info.generation;
374+
375+
/* Check that we couldn't have gotten lapped during memcpy. */
376+
if (rx->buffer[sentinel_chunk].u.info.generation !=
377+
sentinel_chunk_generation)
378+
{
379+
__exanic_rx_catchup(rx);
380+
return -EXANIC_RX_FRAME_SWOVFL;
381+
}
382+
383+
size += u.info.length;
366384

367385
if (timestamp != NULL)
368386
*timestamp = u.info.timestamp;
@@ -384,13 +402,6 @@ ssize_t exanic_receive_frame(exanic_rx_t *rx, char *rx_buf, size_t rx_buf_size,
384402
do
385403
u.data = rx->buffer[rx->next_chunk].u.data;
386404
while (u.info.generation == (uint8_t)(rx->generation - 1));
387-
388-
if (u.info.generation != rx->generation)
389-
{
390-
/* Got lapped? */
391-
__exanic_rx_catchup(rx);
392-
return -EXANIC_RX_FRAME_SWOVFL;
393-
}
394405
}
395406
}
396407
}
@@ -418,6 +429,12 @@ ssize_t exanic_receive_chunk(exanic_rx_t *rx, char *rx_buf, int *more_chunks)
418429

419430
if (u.info.generation == rx->generation)
420431
{
432+
/* Move the sentinel chunk forward. */
433+
uint32_t sentinel_chunk = rx->sentinel_chunk;
434+
uint8_t sentinel_chunk_generation = rx->sentinel_chunk_generation;
435+
rx->sentinel_chunk = rx->next_chunk;
436+
rx->sentinel_chunk_generation = u.info.generation;
437+
421438
/* Data is available */
422439
const char *payload = (char *)rx->buffer[rx->next_chunk].payload;
423440

@@ -434,7 +451,12 @@ ssize_t exanic_receive_chunk(exanic_rx_t *rx, char *rx_buf, int *more_chunks)
434451
/* Last chunk */
435452
memcpy(rx_buf, payload, u.info.length);
436453

437-
/* TODO: Recheck that we haven't been lapped */
454+
/* Check that we couldn't have gotten lapped during memcpy. */
455+
if (rx->buffer[sentinel_chunk].u.info.generation != sentinel_chunk_generation)
456+
{
457+
__exanic_rx_catchup(rx);
458+
return -EXANIC_RX_FRAME_SWOVFL;
459+
}
438460

439461
if (u.info.frame_status & EXANIC_RX_FRAME_ERROR_MASK)
440462
return -(u.info.frame_status & EXANIC_RX_FRAME_ERROR_MASK);
@@ -447,6 +469,13 @@ ssize_t exanic_receive_chunk(exanic_rx_t *rx, char *rx_buf, int *more_chunks)
447469
/* More chunks to come */
448470
memcpy(rx_buf, payload, EXANIC_RX_CHUNK_PAYLOAD_SIZE);
449471

472+
/* Check that we couldn't have gotten lapped during memcpy. */
473+
if (rx->buffer[sentinel_chunk].u.info.generation != sentinel_chunk_generation)
474+
{
475+
__exanic_rx_catchup(rx);
476+
return -EXANIC_RX_FRAME_SWOVFL;
477+
}
478+
450479
*more_chunks = 1;
451480
return EXANIC_RX_CHUNK_PAYLOAD_SIZE;
452481
}
@@ -476,6 +505,12 @@ ssize_t exanic_receive_chunk_ex(exanic_rx_t *rx, char *rx_buf, int *more_chunks,
476505

477506
if (u.info.generation == rx->generation)
478507
{
508+
/* Move the sentinel chunk forward. */
509+
uint32_t sentinel_chunk = rx->sentinel_chunk;
510+
uint8_t sentinel_chunk_generation = rx->sentinel_chunk_generation;
511+
rx->sentinel_chunk = rx->next_chunk;
512+
rx->sentinel_chunk_generation = u.info.generation;
513+
479514
/* Data is available */
480515
const char *payload = (char *)rx->buffer[rx->next_chunk].payload;
481516
uint8_t length;
@@ -492,7 +527,14 @@ ssize_t exanic_receive_chunk_ex(exanic_rx_t *rx, char *rx_buf, int *more_chunks,
492527
more = (u.info.length == 0);
493528
length = more ? EXANIC_RX_CHUNK_PAYLOAD_SIZE : u.info.length;
494529
memcpy(rx_buf, payload, length);
495-
/* TODO: Recheck that we haven't been lapped */
530+
531+
/* Check that we couldn't have gotten lapped during memcpy. */
532+
if (rx->buffer[sentinel_chunk].u.info.generation != sentinel_chunk_generation)
533+
{
534+
__exanic_rx_catchup(rx);
535+
return -EXANIC_RX_FRAME_SWOVFL;
536+
}
537+
496538
*more_chunks = more;
497539
*info = u.info;
498540
return length;

libs/exanic/fifo_rx.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ typedef struct exanic_rx
3939
int port_number;
4040
int buffer_number;
4141
volatile struct rx_chunk *buffer;
42+
uint32_t sentinel_chunk;
43+
uint8_t sentinel_chunk_generation;
4244
uint32_t next_chunk;
4345
uint8_t generation;
4446
} exanic_rx_t;

0 commit comments

Comments
 (0)