[RHEL-71813] Combine vnet header and data #1352

ybendito · 2025-04-24T15:36:51Z

No description provided.

JonKohler · 2025-04-25T19:46:20Z

@sb-ntnx this is pretty cool

ybendito · 2025-04-25T20:31:57Z

@sb-ntnx this is pretty cool

Note that this does not change things for virtual device, only for bare metal.

ybendito · 2025-04-26T21:27:51Z

One failure of all runs: SingleEtherType on Win10_22H2, unclear assertion and BSOD, this is the same timeout of ndtest on bind operation, that we saw twice.
On rerun - everything ok as expected

YanVugenfirer · 2025-04-27T12:49:33Z

@zjmletang you are weclomed to review

ybendito · 2025-04-28T06:32:07Z

@zjmletang you are weclomed to review

@zjmletang probably even try it and ack that this PR does the job and improves the throughput

zjmletang · 2025-04-28T07:49:47Z

@YanVugenfirer @ybendito Thank you both. I feel honored to have this learning opportunity. I carefully read through this PR, and I found a point that may not align with my thoughts, which I would like to discuss.

My initial intention for the #1089 was to reduce the number of descriptors for RX packets by merging the virtio header and data. Under the right conditions (specifically, the ability to allocate about 64k of contiguous memory, which is feasible in most scenarios), ideally, one RX packet should be represented by a single descriptor. This would significantly enhance the DMA performance on the host side. The patch I submitted previously aimed to describe an RX packet using only one descriptor (at that time, the new RX layout mechanism had not yet been introduced). However, under the current RX layout mechanism (when RX_LAYOUT_AS_BEFORE is undefined by default), although we have merged the virtio header and data into one descriptor, the tail portion still occupies an additional descriptor. As a result, we are unable to achieve the goal of representing one RX packet with a single descriptor. In my last comment on #1089, I expressed concerns that our latest design might conflict with the goal of reducing descriptors (specifically due to the extra descriptor used for the tail).
Thank you for considering my thoughts, and I look forward to your insights!

ybendito · 2025-04-28T08:20:10Z

@zjmletang I understand your point. Indeed with current master the device should:

Download entry descriptor
Download descriptors for header and payload
Push header to RAM
Push payload to RAM
Push total size to entry descriptor
Push entire entry to queue

After my changes:

Download entry descriptor
Download descriptors for payload+header and tail
Push payload+header to RAM
Push total size to entry descriptor
Push entire entry to queue

Your goal is:

Download entry descriptor
Push payload+header to RAM
Push total size to entry descriptor
Push entire entry to queue
That's correct?

zjmletang · 2025-04-28T08:22:14Z

@ybendito In fact, the efficiency of DMA on the host side can generally be influenced by the following two factors (please note that this may apply to multiple hardware platforms and is not specific to Alibaba Cloud):

The number of descriptors: fewer is better (as mentioned earlier).
Whether the descriptors are direct or indirect: direct descriptors are preferable to indirect ones, as indirect requires an additional layer of parsing on the host side.

With this in mind, I would like to discuss a potential modification(Just discussing might indicate that the ideas are not yet fully developed.): When we can represent an RX packet with only one descriptor, could we consider changing it to a direct descriptor? This approach does not violate the virtio protocol, and it seems that Linux implements it this way (as shown in the image below). Such a change could not only save memory (by eliminating the need for the indirect memory portion) but also further enhance DMA efficiency on the host side.

ybendito · 2025-04-28T08:30:59Z

@zjmletang For single descriptor this already works such way, if I'm not mistaken, i.e. for single descriptor the indirect is not used:
https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/VirtIO/VirtIORing.c#L239

zjmletang · 2025-04-28T08:44:53Z

@zjmletang For single descriptor this already works such way, if I'm not mistaken, i.e. for single descriptor the indirect is not used: https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/VirtIO/VirtIORing.c#L239

@ybendito You're right; thank you for the reminder. Given this, does that mean we don't need to pre-allocate the indirect buffer for this type of single descriptor situation? In other words, we wouldn't need to create a separate header page to store the indirect buffer and tail, and we could just have the virtio header and data (including the tail) placed directly in a single descriptor.

zjmletang · 2025-04-28T08:57:08Z

@zjmletang I understand your point. Indeed with current master the device should:

Download entry descriptor

Download descriptors for header and payload

Push header to RAM

Push payload to RAM

Push total size to entry descriptor

Push entire entry to queue

After my changes:

Download entry descriptor

Download descriptors for payload+header and tail

Push payload+header to RAM

Push total size to entry descriptor

Push entire entry to queue

Your goal is:

Download entry descriptor

Push payload+header to RAM

Push total size to entry descriptor

Push entire entry to queue
That's correct?

@ybendito Exactly, my main goal is to have as few descriptors as possible included in the entry descriptor, ideally just one.

ybendito · 2025-04-28T09:28:27Z

@zjmletang For single descriptor this already works such way, if I'm not mistaken, i.e. for single descriptor the indirect is not used: https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/VirtIO/VirtIORing.c#L239

@ybendito You're right; thank you for the reminder. Given this, does that mean we don't need to pre-allocate the indirect buffer for this type of single descriptor situation? In other words, we wouldn't need to create a separate header page to store the indirect buffer and tail, and we could just have the virtio header and data (including the tail) placed directly in a single descriptor.

We never can know for sure we'll be able to allocate a large buffer (16 pages and up). For that we need the indirect area. Currently (this PR) for indirect + tail we allocate only 512 bytes, not a full page.

Fixes: 8860475 (netkvm: use optimized layout of shared memories for RX) On DMA memory allocation failure the driver does not change the amount of allocated memory. So if the large allocation fails, the RX descriptor will not be created at all. Signed-off-by: Yuri Benditovich <[email protected]>

Add new fields for flexibility of: - page number where we have a virtio header (currently 0) offset of data start in page 1 (currently 0) - index of page number (currently there is no difference between page index and index in SG array) Currently these fields are not in use. In future, in case the header and data are combined: the header page will be 1 the data start offset will be size of vnet header the first page will not be used by SG, it will contain only indirect area and the tail of the data. Signed-off-by: Yuri Benditovich <[email protected]>

First, this is not needed. The exact size of the payload is defined in the NB and the MDL chain may be longer. Second, this complicates the configuration of the first MDL when we need to combine vnet header and data. If we change the length of MDL we need to restore it upon deletion, so we need to save the initial length (it might be different from the size of respective page if the data and header are combined). Signed-off-by: Yuri Benditovich <[email protected]>

Currently nothing changed and the RX initialization works exactly as before, but the code is ready to support the case when we do not reserve any separate space for the vnet header and merge it with packet data. Signed-off-by: Yuri Benditovich <[email protected]>

Jira link: https://issues.redhat.com/browse/RHEL-71813 If the device is V1 or suggests VIRTIO_F_ANY_LAYOUT it does not have any assumptions regarding framing of packets and does not require to have separate descriptor for vnet header. On bare metal device this reduces number of DMA transaction in RX path. In order to get back to previous behavior it is enough to revert this commit or just set combineHeaderAndData to be false always. Signed-off-by: Yuri Benditovich <[email protected]>

Should be defined as 0 or 1, for RHEL we define 1 This will be used in configuration of RX memory allocation. Value=0 will be probably better for bare-metal implementation. Signed-off-by: Yuri Benditovich <[email protected]>

Jira link: https://issues.redhat.com/browse/RHEL-71813 Fixes: virtio-win#1078 If "Init.SeparateRxTail" is set to 0 the driver will try to allocate contiguous memory block for each RX buffer. If it succeeds, the virtio descriptor will contain single entry and the device will not spend time retrieving the indirect chain. For the bare metail device this might be better. We leave the default value "1" and by default the driver will pace the tail in separate memory block (where we keep indirect descriptors). Signed-off-by: Yuri Benditovich <[email protected]>

This is a possibility to simulate a failure to allocate large blocks. Will return only blocks of page and less, if defined. Signed-off-by: Yuri Benditovich <[email protected]>

ybendito · 2025-05-04T13:23:32Z

@zjmletang Please try this one with configuration "Init.SeparateRxTail"=Disable

zjmletang · 2025-05-05T10:59:46Z

@zjmletang Please try this one with configuration "Init.SeparateRxTail"=Disable

@ybendito great! My leave ends tomorrow, and I will test it as soon as possible and provide feedback to you

ybendito · 2025-05-06T05:25:06Z

@zjmletang If you do your own build, note the commit with INX_NETKVM_SEPARATE_TAIL, place it in your vendor file, probably you'll want it = 0

ybendito · 2025-05-07T06:21:15Z

Tests:
Win11: unclear failure of DF - Reinstall with IO Before and After, passed on rerun
2022: unclear failure of PNP DIF Remove Device, passed on rerun

zjmletang · 2025-05-08T04:10:19Z

@zjmletang Please try this one with configuration "Init.SeparateRxTail"=Disable

Init.SeparateRxTail"=Disable

@ybendito
The following are our test results: This patch significantly improves throughput on the host side. We validated it on two internal hypervisors, both of which implement the virtio net backend in hardware.

Test Results Analysis:
From the host's theoretical perspective, the mechanism optimized by the patch (Init.SeparateRxTail = Enable) reduces the number of fragments from 3 to 2 compared to the master branch. Most data resides in the first fragment, with the second fragment rarely used, thereby reducing DMA requests. When Init.SeparateRxTail = Disable, the primary benefit lies in switching from indirect to direct mode, eliminating the need to parse entry descriptors, which further enhances throughput. The above information is for reference.

ybendito · 2025-05-08T04:52:45Z

@zjmletang Thank you for sharing the data. If possible please describe in short the testing conditions:

typical packet size
kind of transfer (tcp/udp/raw/-nagle etc)

zjmletang · 2025-05-08T05:37:02Z

@zjmletang Thank you for sharing the data. If possible please describe in short the testing conditions:

typical packet size

kind of transfer (tcp/udp/raw/-nagle etc)

@ybendito
packet size: 1400
kind of transfer: udp

zjmletang · 2025-05-08T05:46:46Z

@ybendito I have a few additional questions and discussion points:

Is it necessary to expose the Init.SeparateRxTail parameter? On one hand, this parameter is more of an implementation detail compared to other NIC parameters, and I am concerned about explaining its meaning to external users. On the other hand, I do not see any advantages of enabling SeparateRxTail over disabling it. For hardware-implemented virtio device backends, disabling it improves throughput (as shown in the test results above). Even for non-hardware virtio device implementations, it does not seem to cause any side effects.
When using bLargeSingleAllocation, can the memory allocated for the first block (which contains only the indirect area) be released earlier in CreateRxDescriptorOnInit? Since bLargeSingleAllocation uses direct mode, the indirect area is no longer needed.
When using ParaNdis_InitialAllocatePhysicalMemory to allocate memory, the allocated size is not aligned. According to the MSDN documentation for NdisMAllocateSharedMemory, it seems to mention the need to use NdisMGetDmaAlignment to obtain the alignment boundary (I'm not sure if my understanding is correct).

ybendito · 2025-05-08T06:19:56Z

@ybendito I have a few additional questions and discussion points:

Is it necessary to expose the Init.SeparateRxTail parameter? On one hand, this parameter is more of an implementation detail compared to other NIC parameters, and I am concerned about explaining its meaning to external users. On the other hand, I do not see any advantages of enabling SeparateRxTail over disabling it. For hardware-implemented virtio device backends, disabling it improves throughput (as shown in the test results above). Even for non-hardware virtio device implementations, it does not seem to cause any side effects.

We prefer to have a simple way to compare the performance of Enabled/Disabled in various scenarios on QEMU.
Note that in case the separate tail is disabled we create many NDIS-managed memory blocks that nobody uses (4K - tail).
The question whether we really need the tail is very interesting. It is untrivial to send [64K + tail]-sized packet to virtio-net device.

When using bLargeSingleAllocation, can the memory allocated for the first block (which contains only the indirect area) be released earlier in CreateRxDescriptorOnInit? Since bLargeSingleAllocation uses direct mode, the indirect area is no longer needed.

Sure, it can be freed if not needed, but not mandatory in the context of this PR. Additional free operation will create some additional system load during initialization, probably increasing init time, especially in scenario with several devices.

When using ParaNdis_InitialAllocatePhysicalMemory to allocate memory, the allocated size is not aligned. According to the MSDN documentation for NdisMAllocateSharedMemory, it seems to mention the need to use NdisMGetDmaAlignment to obtain the alignment boundary (I'm not sure if my understanding is correct).

We need to check this. I never saw unaligned blocks returned from NdisMAllocateSharedMemory.

zjmletang · 2025-05-08T06:51:42Z

@ybendito I have a few additional questions and discussion points:

Is it necessary to expose the Init.SeparateRxTail parameter? On one hand, this parameter is more of an implementation detail compared to other NIC parameters, and I am concerned about explaining its meaning to external users. On the other hand, I do not see any advantages of enabling SeparateRxTail over disabling it. For hardware-implemented virtio device backends, disabling it improves throughput (as shown in the test results above). Even for non-hardware virtio device implementations, it does not seem to cause any side effects.

We prefer to have a simple way to compare the performance of Enabled/Disabled in various scenarios on QEMU. Note that in case the separate tail is disabled we create many NDIS-managed memory blocks that nobody uses (4K - tail). The question whether we really need the tail is very interesting. It is untrivial to send [64K + tail]-sized packet to virtio-net device.

@ybendito Could you help me understand why the extra memory wasted here is 4k - tail(when disabled)? From my examination of the code, it looks like only memory of the size of 'tail' is allocated, and this extra memory is located in the first block. Additionally, considering the second point, since this extra memory can be freed, it seems that the advantage of saving memory might not actually exist

When using bLargeSingleAllocation, can the memory allocated for the first block (which contains only the indirect area) be released earlier in CreateRxDescriptorOnInit? Since bLargeSingleAllocation uses direct mode, the indirect area is no longer needed.

Sure, it can be freed if not needed, but not mandatory in the context of this PR. Additional free operation will create some additional system load during initialization, probably increasing init time, especially in scenario with several devices.

ybendito · 2025-05-08T07:05:48Z

@zjmletang When we allocate 64K+30 bytes of shared memory we always get memory that start from the beginning of the page, the remainder of the last page is theoretically free but this page can be used by NDIS only to allocate the small memory chunk. And typically small chunks are much less required and rarely used.

zjmletang · 2025-05-08T09:05:28Z

@

@zjmletang When we allocate 64K+30 bytes of shared memory we always get memory that start from the beginning of the page, the remainder of the last page is theoretically free but this page can be used by NDIS only to allocate the small memory chunk. And typically small chunks are much less required and rarely used.

@ybendito Thank you, this patch has so many benefits, and many hardware-implemented hypervisors will benefit from it. I have no other questions. cc @YanVugenfirer

ybendito · 2025-05-08T09:41:49Z

Tests: Win11:
unclear failure of DF - Reinstall with IO Before and After, passed on rerun
2022: unclear failure of PNP DIF Remove Device, passed on rerun
For random failures of PnP test I've created https://issues.redhat.com/browse/RHEL-90180
I think we can merge it @YanVugenfirer

ybendito requested a review from YanVugenfirer as a code owner April 24, 2025 15:36

ybendito force-pushed the RHEL-71813 branch 3 times, most recently from 294a306 to 204eef0 Compare April 25, 2025 19:11

ybendito mentioned this pull request Apr 25, 2025

netkvm: enhancing host throughput by combining virtio header and data in a single descriptor, where possible #1089

Closed

YanVugenfirer approved these changes Apr 27, 2025

View reviewed changes

YanVugenfirer requested a review from zjmletang April 27, 2025 12:49

ybendito added 5 commits May 3, 2025 00:35

ybendito force-pushed the RHEL-71813 branch 5 times, most recently from 6615b95 to afcae2b Compare May 3, 2025 16:25

ybendito force-pushed the RHEL-71813 branch from afcae2b to 23dc39a Compare May 4, 2025 13:17

netkvm: add ifdef to test smaller RX memory allocations

17b24a8

This is a possibility to simulate a failure to allocate large blocks. Will return only blocks of page and less, if defined. Signed-off-by: Yuri Benditovich <[email protected]>

zjmletang approved these changes May 8, 2025

View reviewed changes

YanVugenfirer merged commit df9c78c into virtio-win:master May 10, 2025
7 of 9 checks passed

ybendito deleted the RHEL-71813 branch May 16, 2025 05:48

[RHEL-71813] Combine vnet header and data #1352

[RHEL-71813] Combine vnet header and data #1352

Uh oh!

Conversation

ybendito commented Apr 24, 2025

Uh oh!

JonKohler commented Apr 25, 2025

Uh oh!

ybendito commented Apr 25, 2025

Uh oh!

ybendito commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YanVugenfirer commented Apr 27, 2025

Uh oh!

ybendito commented Apr 28, 2025

Uh oh!

zjmletang commented Apr 28, 2025

Uh oh!

ybendito commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjmletang commented Apr 28, 2025

Uh oh!

ybendito commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjmletang commented Apr 28, 2025

Uh oh!

zjmletang commented Apr 28, 2025

Uh oh!

ybendito commented Apr 28, 2025

Uh oh!

ybendito commented May 4, 2025

Uh oh!

zjmletang commented May 5, 2025

Uh oh!

ybendito commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ybendito commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjmletang commented May 8, 2025

Uh oh!

ybendito commented May 8, 2025

Uh oh!

zjmletang commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjmletang commented May 8, 2025

Uh oh!

ybendito commented May 8, 2025

Uh oh!

zjmletang commented May 8, 2025

Uh oh!

ybendito commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zjmletang commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ybendito commented May 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ybendito commented Apr 26, 2025 •

edited

Loading

ybendito commented Apr 28, 2025 •

edited

Loading

ybendito commented Apr 28, 2025 •

edited

Loading

ybendito commented May 6, 2025 •

edited

Loading

ybendito commented May 7, 2025 •

edited

Loading

zjmletang commented May 8, 2025 •

edited

Loading

ybendito commented May 8, 2025 •

edited

Loading

zjmletang commented May 8, 2025 •

edited

Loading

ybendito commented May 8, 2025 •

edited

Loading