-
Notifications
You must be signed in to change notification settings - Fork 410
[RHEL-71813] Combine vnet header and data #1352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
294a306
to
204eef0
Compare
@sb-ntnx this is pretty cool |
Note that this does not change things for virtual device, only for bare metal. |
One failure of all runs: SingleEtherType on Win10_22H2, unclear assertion and BSOD, this is the same timeout of ndtest on bind operation, that we saw twice. |
@zjmletang you are weclomed to review |
@zjmletang probably even try it and ack that this PR does the job and improves the throughput |
@YanVugenfirer @ybendito Thank you both. I feel honored to have this learning opportunity. I carefully read through this PR, and I found a point that may not align with my thoughts, which I would like to discuss. My initial intention for the #1089 was to reduce the number of descriptors for RX packets by merging the virtio header and data. Under the right conditions (specifically, the ability to allocate about 64k of contiguous memory, which is feasible in most scenarios), ideally, one RX packet should be represented by a single descriptor. This would significantly enhance the DMA performance on the host side. The patch I submitted previously aimed to describe an RX packet using only one descriptor (at that time, the new RX layout mechanism had not yet been introduced). However, under the current RX layout mechanism (when RX_LAYOUT_AS_BEFORE is undefined by default), although we have merged the virtio header and data into one descriptor, the tail portion still occupies an additional descriptor. As a result, we are unable to achieve the goal of representing one RX packet with a single descriptor. In my last comment on #1089, I expressed concerns that our latest design might conflict with the goal of reducing descriptors (specifically due to the extra descriptor used for the tail). |
@zjmletang I understand your point. Indeed with current master the device should:
After my changes:
Your goal is:
|
@ybendito In fact, the efficiency of DMA on the host side can generally be influenced by the following two factors (please note that this may apply to multiple hardware platforms and is not specific to Alibaba Cloud):
With this in mind, I would like to discuss a potential modification(Just discussing might indicate that the ideas are not yet fully developed.): When we can represent an RX packet with only one descriptor, could we consider changing it to a direct descriptor? This approach does not violate the virtio protocol, and it seems that Linux implements it this way (as shown in the image below). Such a change could not only save memory (by eliminating the need for the indirect memory portion) but also further enhance DMA efficiency on the host side. |
@zjmletang For single descriptor this already works such way, if I'm not mistaken, i.e. for single descriptor the indirect is not used: |
@ybendito You're right; thank you for the reminder. Given this, does that mean we don't need to pre-allocate the indirect buffer for this type of single descriptor situation? In other words, we wouldn't need to create a separate header page to store the indirect buffer and tail, and we could just have the virtio header and data (including the tail) placed directly in a single descriptor. |
@ybendito Exactly, my main goal is to have as few descriptors as possible included in the entry descriptor, ideally just one. |
We never can know for sure we'll be able to allocate a large buffer (16 pages and up). For that we need the indirect area. Currently (this PR) for indirect + tail we allocate only 512 bytes, not a full page. |
Fixes: 8860475 (netkvm: use optimized layout of shared memories for RX) On DMA memory allocation failure the driver does not change the amount of allocated memory. So if the large allocation fails, the RX descriptor will not be created at all. Signed-off-by: Yuri Benditovich <[email protected]>
Add new fields for flexibility of: - page number where we have a virtio header (currently 0) offset of data start in page 1 (currently 0) - index of page number (currently there is no difference between page index and index in SG array) Currently these fields are not in use. In future, in case the header and data are combined: the header page will be 1 the data start offset will be size of vnet header the first page will not be used by SG, it will contain only indirect area and the tail of the data. Signed-off-by: Yuri Benditovich <[email protected]>
First, this is not needed. The exact size of the payload is defined in the NB and the MDL chain may be longer. Second, this complicates the configuration of the first MDL when we need to combine vnet header and data. If we change the length of MDL we need to restore it upon deletion, so we need to save the initial length (it might be different from the size of respective page if the data and header are combined). Signed-off-by: Yuri Benditovich <[email protected]>
Currently nothing changed and the RX initialization works exactly as before, but the code is ready to support the case when we do not reserve any separate space for the vnet header and merge it with packet data. Signed-off-by: Yuri Benditovich <[email protected]>
Jira link: https://issues.redhat.com/browse/RHEL-71813 If the device is V1 or suggests VIRTIO_F_ANY_LAYOUT it does not have any assumptions regarding framing of packets and does not require to have separate descriptor for vnet header. On bare metal device this reduces number of DMA transaction in RX path. In order to get back to previous behavior it is enough to revert this commit or just set combineHeaderAndData to be false always. Signed-off-by: Yuri Benditovich <[email protected]>
6615b95
to
afcae2b
Compare
Should be defined as 0 or 1, for RHEL we define 1 This will be used in configuration of RX memory allocation. Value=0 will be probably better for bare-metal implementation. Signed-off-by: Yuri Benditovich <[email protected]>
Jira link: https://issues.redhat.com/browse/RHEL-71813 Fixes: virtio-win#1078 If "Init.SeparateRxTail" is set to 0 the driver will try to allocate contiguous memory block for each RX buffer. If it succeeds, the virtio descriptor will contain single entry and the device will not spend time retrieving the indirect chain. For the bare metail device this might be better. We leave the default value "1" and by default the driver will pace the tail in separate memory block (where we keep indirect descriptors). Signed-off-by: Yuri Benditovich <[email protected]>
This is a possibility to simulate a failure to allocate large blocks. Will return only blocks of page and less, if defined. Signed-off-by: Yuri Benditovich <[email protected]>
@zjmletang Please try this one with configuration "Init.SeparateRxTail"=Disable |
@ybendito great! My leave ends tomorrow, and I will test it as soon as possible and provide feedback to you |
@zjmletang If you do your own build, note the commit with INX_NETKVM_SEPARATE_TAIL, place it in your vendor file, probably you'll want it = 0 |
Tests: |
@ybendito Test Results Analysis: |
@zjmletang Thank you for sharing the data. If possible please describe in short the testing conditions:
|
@ybendito |
@ybendito I have a few additional questions and discussion points:
|
We prefer to have a simple way to compare the performance of Enabled/Disabled in various scenarios on QEMU.
Sure, it can be freed if not needed, but not mandatory in the context of this PR. Additional free operation will create some additional system load during initialization, probably increasing init time, especially in scenario with several devices.
We need to check this. I never saw unaligned blocks returned from NdisMAllocateSharedMemory. |
@ybendito Could you help me understand why the extra memory wasted here is 4k - tail(when disabled)? From my examination of the code, it looks like only memory of the size of 'tail' is allocated, and this extra memory is located in the first block. Additionally, considering the second point, since this extra memory can be freed, it seems that the advantage of saving memory might not actually exist
|
@zjmletang When we allocate 64K+30 bytes of shared memory we always get memory that start from the beginning of the page, the remainder of the last page is theoretically free but this page can be used by NDIS only to allocate the small memory chunk. And typically small chunks are much less required and rarely used. |
@
@ybendito Thank you, this patch has so many benefits, and many hardware-implemented hypervisors will benefit from it. I have no other questions. cc @YanVugenfirer |
Tests: Win11: |
No description provided.