Skip to content

[RHEL-71813] Combine vnet header and data #1352

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 10, 2025

Conversation

ybendito
Copy link
Collaborator

No description provided.

@JonKohler
Copy link
Contributor

@sb-ntnx this is pretty cool

@ybendito
Copy link
Collaborator Author

@sb-ntnx this is pretty cool

Note that this does not change things for virtual device, only for bare metal.

@ybendito
Copy link
Collaborator Author

ybendito commented Apr 26, 2025

One failure of all runs: SingleEtherType on Win10_22H2, unclear assertion and BSOD, this is the same timeout of ndtest on bind operation, that we saw twice.
On rerun - everything ok as expected

@YanVugenfirer
Copy link
Collaborator

@zjmletang you are weclomed to review

@YanVugenfirer YanVugenfirer requested a review from zjmletang April 27, 2025 12:49
@ybendito
Copy link
Collaborator Author

@zjmletang you are weclomed to review

@zjmletang probably even try it and ack that this PR does the job and improves the throughput

@zjmletang
Copy link
Member

@YanVugenfirer @ybendito Thank you both. I feel honored to have this learning opportunity. I carefully read through this PR, and I found a point that may not align with my thoughts, which I would like to discuss.

My initial intention for the #1089 was to reduce the number of descriptors for RX packets by merging the virtio header and data. Under the right conditions (specifically, the ability to allocate about 64k of contiguous memory, which is feasible in most scenarios), ideally, one RX packet should be represented by a single descriptor. This would significantly enhance the DMA performance on the host side. The patch I submitted previously aimed to describe an RX packet using only one descriptor (at that time, the new RX layout mechanism had not yet been introduced). However, under the current RX layout mechanism (when RX_LAYOUT_AS_BEFORE is undefined by default), although we have merged the virtio header and data into one descriptor, the tail portion still occupies an additional descriptor. As a result, we are unable to achieve the goal of representing one RX packet with a single descriptor. In my last comment on #1089, I expressed concerns that our latest design might conflict with the goal of reducing descriptors (specifically due to the extra descriptor used for the tail).
Thank you for considering my thoughts, and I look forward to your insights!

@ybendito
Copy link
Collaborator Author

ybendito commented Apr 28, 2025

@zjmletang I understand your point. Indeed with current master the device should:

  1. Download entry descriptor
  2. Download descriptors for header and payload
  3. Push header to RAM
  4. Push payload to RAM
  5. Push total size to entry descriptor
  6. Push entire entry to queue

After my changes:

  1. Download entry descriptor
  2. Download descriptors for payload+header and tail
  3. Push payload+header to RAM
  4. Push total size to entry descriptor
  5. Push entire entry to queue

Your goal is:

  1. Download entry descriptor
  2. Push payload+header to RAM
  3. Push total size to entry descriptor
  4. Push entire entry to queue
    That's correct?

@zjmletang
Copy link
Member

@ybendito In fact, the efficiency of DMA on the host side can generally be influenced by the following two factors (please note that this may apply to multiple hardware platforms and is not specific to Alibaba Cloud):

  1. The number of descriptors: fewer is better (as mentioned earlier).
  2. Whether the descriptors are direct or indirect: direct descriptors are preferable to indirect ones, as indirect requires an additional layer of parsing on the host side.

With this in mind, I would like to discuss a potential modification(Just discussing might indicate that the ideas are not yet fully developed.): When we can represent an RX packet with only one descriptor, could we consider changing it to a direct descriptor? This approach does not violate the virtio protocol, and it seems that Linux implements it this way (as shown in the image below). Such a change could not only save memory (by eliminating the need for the indirect memory portion) but also further enhance DMA efficiency on the host side.
image

@ybendito
Copy link
Collaborator Author

ybendito commented Apr 28, 2025

@zjmletang For single descriptor this already works such way, if I'm not mistaken, i.e. for single descriptor the indirect is not used:
https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/VirtIO/VirtIORing.c#L239

@zjmletang
Copy link
Member

@zjmletang For single descriptor this already works such way, if I'm not mistaken, i.e. for single descriptor the indirect is not used: https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/VirtIO/VirtIORing.c#L239

@ybendito You're right; thank you for the reminder. Given this, does that mean we don't need to pre-allocate the indirect buffer for this type of single descriptor situation? In other words, we wouldn't need to create a separate header page to store the indirect buffer and tail, and we could just have the virtio header and data (including the tail) placed directly in a single descriptor.

@zjmletang
Copy link
Member

@zjmletang I understand your point. Indeed with current master the device should:

  1. Download entry descriptor
  2. Download descriptors for header and payload
  3. Push header to RAM
  4. Push payload to RAM
  5. Push total size to entry descriptor
  6. Push entire entry to queue

After my changes:

  1. Download entry descriptor
  2. Download descriptors for payload+header and tail
  3. Push payload+header to RAM
  4. Push total size to entry descriptor
  5. Push entire entry to queue

Your goal is:

  1. Download entry descriptor
  2. Push payload+header to RAM
  3. Push total size to entry descriptor
  4. Push entire entry to queue
    That's correct?

@ybendito Exactly, my main goal is to have as few descriptors as possible included in the entry descriptor, ideally just one.

@ybendito
Copy link
Collaborator Author

@zjmletang For single descriptor this already works such way, if I'm not mistaken, i.e. for single descriptor the indirect is not used: https://github.com/virtio-win/kvm-guest-drivers-windows/blob/master/VirtIO/VirtIORing.c#L239

@ybendito You're right; thank you for the reminder. Given this, does that mean we don't need to pre-allocate the indirect buffer for this type of single descriptor situation? In other words, we wouldn't need to create a separate header page to store the indirect buffer and tail, and we could just have the virtio header and data (including the tail) placed directly in a single descriptor.

We never can know for sure we'll be able to allocate a large buffer (16 pages and up). For that we need the indirect area. Currently (this PR) for indirect + tail we allocate only 512 bytes, not a full page.

ybendito added 5 commits May 3, 2025 00:35
Fixes: 8860475 (netkvm: use optimized layout of shared memories for RX)

On DMA memory allocation failure the driver does not change
the amount of allocated memory. So if the large allocation
fails, the RX descriptor will not be created at all.

Signed-off-by: Yuri Benditovich <[email protected]>
Add new fields for flexibility of:
- page number where we have a virtio header (currently 0)
  offset of data start in page 1 (currently 0)
- index of page number (currently there is no difference
  between page index and index in SG array)
Currently these fields are not in use. In future, in case
the header and data are combined:
the header page will be 1
the data start offset will be size of vnet header
the first page will not be used by SG, it will contain
only indirect area and the tail of the data.

Signed-off-by: Yuri Benditovich <[email protected]>
First, this is not needed. The exact size of the payload
is defined in the NB and the MDL chain may be longer.
Second, this complicates the configuration of the first
MDL when we need to combine vnet header and data.
If we change the length of MDL we need to restore it upon
deletion, so we need to save the initial length (it might
be different from the size of respective page if the data
and header are combined).

Signed-off-by: Yuri Benditovich <[email protected]>
Currently nothing changed and the RX initialization works
exactly as before, but the code is ready to support the case
when we do not reserve any separate space for the vnet header
and merge it with packet data.

Signed-off-by: Yuri Benditovich <[email protected]>
Jira link: https://issues.redhat.com/browse/RHEL-71813

If the device is V1 or suggests VIRTIO_F_ANY_LAYOUT it does
not have any assumptions regarding framing of packets and
does not require to have separate descriptor for vnet
header. On bare metal device this reduces number of DMA
transaction in RX path.
In order to get back to previous behavior it is enough to
revert this commit or just set combineHeaderAndData to
be false always.

Signed-off-by: Yuri Benditovich <[email protected]>
@ybendito ybendito force-pushed the RHEL-71813 branch 5 times, most recently from 6615b95 to afcae2b Compare May 3, 2025 16:25
Should be defined as 0 or 1, for RHEL we define 1
This will be used in configuration of RX memory allocation.
Value=0 will be probably better for bare-metal implementation.

Signed-off-by: Yuri Benditovich <[email protected]>
Jira link: https://issues.redhat.com/browse/RHEL-71813
Fixes: virtio-win#1078

If "Init.SeparateRxTail" is set to 0 the driver will try to
allocate contiguous memory block for each RX buffer. If it
succeeds, the virtio descriptor will contain single entry
and the device will not spend time retrieving the indirect
chain. For the bare metail device this might be better.
We leave the default value "1" and by default the driver
will pace the tail in separate memory block (where we
keep indirect descriptors).

Signed-off-by: Yuri Benditovich <[email protected]>
This is a possibility to simulate a failure to allocate
large blocks. Will return only blocks of page and less,
if defined.

Signed-off-by: Yuri Benditovich <[email protected]>
@ybendito
Copy link
Collaborator Author

ybendito commented May 4, 2025

@zjmletang Please try this one with configuration "Init.SeparateRxTail"=Disable

@zjmletang
Copy link
Member

@zjmletang Please try this one with configuration "Init.SeparateRxTail"=Disable

@ybendito great! My leave ends tomorrow, and I will test it as soon as possible and provide feedback to you

@ybendito
Copy link
Collaborator Author

ybendito commented May 6, 2025

@zjmletang If you do your own build, note the commit with INX_NETKVM_SEPARATE_TAIL, place it in your vendor file, probably you'll want it = 0

@ybendito
Copy link
Collaborator Author

ybendito commented May 7, 2025

Tests:
Win11: unclear failure of DF - Reinstall with IO Before and After, passed on rerun
2022: unclear failure of PNP DIF Remove Device, passed on rerun

@zjmletang
Copy link
Member

@zjmletang Please try this one with configuration "Init.SeparateRxTail"=Disable

Init.SeparateRxTail"=Disable

@ybendito
The following are our test results: This patch significantly improves throughput on the host side. We validated it on two internal hypervisors, both of which implement the virtio net backend in hardware.

image

Test Results Analysis:
From the host's theoretical perspective, the mechanism optimized by the patch (Init.SeparateRxTail = Enable) reduces the number of fragments from 3 to 2 compared to the master branch. Most data resides in the first fragment, with the second fragment rarely used, thereby reducing DMA requests. When Init.SeparateRxTail = Disable, the primary benefit lies in switching from indirect to direct mode, eliminating the need to parse entry descriptors, which further enhances throughput. The above information is for reference.

@ybendito
Copy link
Collaborator Author

ybendito commented May 8, 2025

@zjmletang Thank you for sharing the data. If possible please describe in short the testing conditions:

  • typical packet size
  • kind of transfer (tcp/udp/raw/-nagle etc)

@zjmletang
Copy link
Member

zjmletang commented May 8, 2025

@zjmletang Thank you for sharing the data. If possible please describe in short the testing conditions:

  • typical packet size
  • kind of transfer (tcp/udp/raw/-nagle etc)

@ybendito
packet size: 1400
kind of transfer: udp

@zjmletang
Copy link
Member

@ybendito I have a few additional questions and discussion points:

  1. Is it necessary to expose the Init.SeparateRxTail parameter? On one hand, this parameter is more of an implementation detail compared to other NIC parameters, and I am concerned about explaining its meaning to external users. On the other hand, I do not see any advantages of enabling SeparateRxTail over disabling it. For hardware-implemented virtio device backends, disabling it improves throughput (as shown in the test results above). Even for non-hardware virtio device implementations, it does not seem to cause any side effects.
  2. When using bLargeSingleAllocation, can the memory allocated for the first block (which contains only the indirect area) be released earlier in CreateRxDescriptorOnInit? Since bLargeSingleAllocation uses direct mode, the indirect area is no longer needed.
  3. When using ParaNdis_InitialAllocatePhysicalMemory to allocate memory, the allocated size is not aligned. According to the MSDN documentation for NdisMAllocateSharedMemory, it seems to mention the need to use NdisMGetDmaAlignment to obtain the alignment boundary (I'm not sure if my understanding is correct).

@ybendito
Copy link
Collaborator Author

ybendito commented May 8, 2025

@ybendito I have a few additional questions and discussion points:

  1. Is it necessary to expose the Init.SeparateRxTail parameter? On one hand, this parameter is more of an implementation detail compared to other NIC parameters, and I am concerned about explaining its meaning to external users. On the other hand, I do not see any advantages of enabling SeparateRxTail over disabling it. For hardware-implemented virtio device backends, disabling it improves throughput (as shown in the test results above). Even for non-hardware virtio device implementations, it does not seem to cause any side effects.

We prefer to have a simple way to compare the performance of Enabled/Disabled in various scenarios on QEMU.
Note that in case the separate tail is disabled we create many NDIS-managed memory blocks that nobody uses (4K - tail).
The question whether we really need the tail is very interesting. It is untrivial to send [64K + tail]-sized packet to virtio-net device.

  1. When using bLargeSingleAllocation, can the memory allocated for the first block (which contains only the indirect area) be released earlier in CreateRxDescriptorOnInit? Since bLargeSingleAllocation uses direct mode, the indirect area is no longer needed.

Sure, it can be freed if not needed, but not mandatory in the context of this PR. Additional free operation will create some additional system load during initialization, probably increasing init time, especially in scenario with several devices.

  1. When using ParaNdis_InitialAllocatePhysicalMemory to allocate memory, the allocated size is not aligned. According to the MSDN documentation for NdisMAllocateSharedMemory, it seems to mention the need to use NdisMGetDmaAlignment to obtain the alignment boundary (I'm not sure if my understanding is correct).

We need to check this. I never saw unaligned blocks returned from NdisMAllocateSharedMemory.

@zjmletang
Copy link
Member

@ybendito I have a few additional questions and discussion points:

  1. Is it necessary to expose the Init.SeparateRxTail parameter? On one hand, this parameter is more of an implementation detail compared to other NIC parameters, and I am concerned about explaining its meaning to external users. On the other hand, I do not see any advantages of enabling SeparateRxTail over disabling it. For hardware-implemented virtio device backends, disabling it improves throughput (as shown in the test results above). Even for non-hardware virtio device implementations, it does not seem to cause any side effects.

We prefer to have a simple way to compare the performance of Enabled/Disabled in various scenarios on QEMU. Note that in case the separate tail is disabled we create many NDIS-managed memory blocks that nobody uses (4K - tail). The question whether we really need the tail is very interesting. It is untrivial to send [64K + tail]-sized packet to virtio-net device.

@ybendito Could you help me understand why the extra memory wasted here is 4k - tail(when disabled)? From my examination of the code, it looks like only memory of the size of 'tail' is allocated, and this extra memory is located in the first block. Additionally, considering the second point, since this extra memory can be freed, it seems that the advantage of saving memory might not actually exist
image

  1. When using bLargeSingleAllocation, can the memory allocated for the first block (which contains only the indirect area) be released earlier in CreateRxDescriptorOnInit? Since bLargeSingleAllocation uses direct mode, the indirect area is no longer needed.

Sure, it can be freed if not needed, but not mandatory in the context of this PR. Additional free operation will create some additional system load during initialization, probably increasing init time, especially in scenario with several devices.

@ybendito
Copy link
Collaborator Author

ybendito commented May 8, 2025

@zjmletang When we allocate 64K+30 bytes of shared memory we always get memory that start from the beginning of the page, the remainder of the last page is theoretically free but this page can be used by NDIS only to allocate the small memory chunk. And typically small chunks are much less required and rarely used.

@zjmletang
Copy link
Member

zjmletang commented May 8, 2025

@

@zjmletang When we allocate 64K+30 bytes of shared memory we always get memory that start from the beginning of the page, the remainder of the last page is theoretically free but this page can be used by NDIS only to allocate the small memory chunk. And typically small chunks are much less required and rarely used.

@ybendito Thank you, this patch has so many benefits, and many hardware-implemented hypervisors will benefit from it. I have no other questions. cc @YanVugenfirer

@ybendito
Copy link
Collaborator Author

ybendito commented May 8, 2025

Tests: Win11:
unclear failure of DF - Reinstall with IO Before and After, passed on rerun
2022: unclear failure of PNP DIF Remove Device, passed on rerun
For random failures of PnP test I've created https://issues.redhat.com/browse/RHEL-90180
I think we can merge it @YanVugenfirer

@YanVugenfirer YanVugenfirer merged commit df9c78c into virtio-win:master May 10, 2025
7 of 9 checks passed
@ybendito ybendito deleted the RHEL-71813 branch May 16, 2025 05:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants