Skip to content

Conversation

@rtamalin
Copy link
Collaborator

@rtamalin rtamalin commented Nov 11, 2025

Define two new tables:

  • profiles This table stores the complete profiles that have been provided by clients as part of the system registration (announce_system) or keepalive (update) request handling, which consists of three fields, product_type, identifier and data.
  • system_profiles This table is used to link systems records to their associated profiles in a one to many relationship, allowing for the same profile records to be shared between multiple systems

Add support for handling any provided profiles to the handlers for the announce_system and update requests in the connect V3 API.

This support involves checking that the provided profiles are valid and, when provided for the first time, complete, with new profiles being added to the profiles table, and appropriate references being added or updated via the system_profiles table to associate systems with their corresponding profiles.

Similarly, when incomplete profiles are included in update requests, the corresponding complete profiles will be retrieved and used, with any missing profiles being dropped and considered problematic.

Additionally, when problematic profiles are detected, whether invalid or incomplete, they will be ignored and the X-System-Profiles-Action header will be included in the response, with a value of clear-cache.

In the System model a new custom attribute, complete_profiles, with an accompanying assignment method, complete_profiles=, is used to handle the assignment of complete profiles to be associated with a system record as part of record creation or update.

If racing requests attempt to create the same new profile, one will succeed and the others will rescue the ActiveRecord:RecordNotUnique exception and instead lookup the newly created profile record.

Add support for optimizing the content of send_bulk_system_update() request to only include the full profile, including both identifier and data fields, on the first occurrence of a profile within the list of serialized systems, with subsequent profile occurrences dropping the data field.

Update the test cases to validate correct operation of new models, and associated request handling changes.

Related Jira: TEL-265

Define two new tables:
  * profiles
    This table stores the complete profiles that have been provided
    by clients as part of the system registration (announce_system)
    or keepalive (update) request handling, which consists of three
    fields, product_type, identifier and data.
  * system_profiles
    This table is used to link systems records to their associated
    profiles in a one to many relationship, allowing for the same
    profile records to be shared between multiple systems

Add support for handling any provided profiles to the handlers for
the announce_system and update requests in the connect V3 API.

This support involves checking that the provided profiles are valid
and, when provided for the first time, complete, with new profiles
being added to the profiles table, and appropriate references being
added or updated via the system_profiles table to associate systems
with their corresponding profiles.

Similarly, when incomplete profiles are included in update requests,
the corresponding complete profiles will be retrieved and used, with
any missing profiles being dropped and considered problematic.

Additionally, when problematic profiles are detected, whether invalid
or incomplete, they will be ignored and the X-System-Profiles-Action
header will be included in the response, with a value of clear-cache.

In the System model a new custom attribute, complete_profiles, with
an accompanying assignment method, complete_profiles=, is used to
handle the assignment of complete profiles to be associated with a
system record as part of record creation or update.

If racing requests attempt to create the same new profile, one will
succeed and the others will rescue the ActiveRecord:RecordNotUnique
exception and instead lookup the newly created profile record.

Update the test cases to validate correct operation of new models,
and associated request handling changes.
@rtamalin
Copy link
Collaborator Author

FYI while doing some heavy stress testing I'm seeing some DB stall issues

@rtamalin
Copy link
Collaborator Author

Investigating and the issue may be related to recent changes on the master branch

@rtamalin
Copy link
Collaborator Author

rtamalin commented Nov 12, 2025

Yup having reverted my test env to be based upon current master branch I still see the same issue... Digging further.

Copy link
Member

@rjschwei rjschwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments should start with uppercase to stay consistent throughout the code base

# check if any profiles have been provided
if params.key?(:system_profiles)
profiles = info_params(:system_profiles)[:system_profiles]
complete, incomplete, invalid = Profile.filter_profiles(profiles.to_h)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we take a more fine grained approach here? At present we are looking at 2 profiles, pci data and loaded kernel module data, if one of these is not correct incomplete or invalid we have a 50/50 chance to guess as to which one from the client is provided incorrectly. When we add more data our chances to guess correctly go down. I would suggest that we loop through the system profiles at this level and then send each profile into the next level down. That way we can pick out which profile may be "broken" and can log an appropriate error. We might even go so far to relay that information to the client.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per our design, the system_profiles entry in the request JSON payload will be a JSON object with the following structure:

  ...

  "system_profiles": {
    "<profile_type>": {
      "identifier": "<profile_identifier_string>",
      "data": "<profile_data_string>"
    },
    ...
  },

  ...
 

There is no guessing here - the combination of <profile_type> and identifier value is what identifies a specific profile, not just the profile identifier value itself.

This approach insulates us from any risks associated with data blobs from different profile_types ever having the same identifier value, e.g. because the hash of their content ends up being the same value; each data blob will be stored independently of each other without any risk of corrupting the data blob associated with a different profile type.

This approach also allows us to introduce new profile types in the future that may use a different identifier generating approach, if desired, without similarly worrying that it could result in overwriting the content associated with a different profile type's profile entry.

If a profile is missing the data field it is considered incomplete, and if it is missing the identifier field it is considered invalid.

For a announce_system request, as part of system registration, only complete profiles are acceptable, per our design, so we only pass on complete profiles to the create!() method, and filter out the incomplete or invalid profiles, additionally setting the X-System-Profile-Actions response header to clear-cache if any incomplete or invalid profiles are found, to indicate to the client that it should clear it's cache and send full profiles next time.

Note, though, that per our design and proposed suseconnect(-ng) implementation, the client should always be sending full profiles anyway for an announce_system request as part of a system registration.

For an update request, as part of the system keepalive notification, the expected optimization is that clients will send incomplete profiles, with only the identifier provided, so we will allow incomplete profiles in that instance, but only those that are already "known", i.e. ones that we already have a matching profile stored in the profiles table, with any "unknown" incomplete profiles being skipped, and triggering the header to be set in the response.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the log message just shows the count, so there is a lot of guessing when I read "problematic profiles detected: 2 incomplete" when at some point we have more than 2 profiles. And from the naming it is not obvious that complete is of strings where the strings represent the profile_type

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can enhance the relevant debug messages to also report the profile types for each problematic category, and to expand the message content to reflect the nature of what each category is, e.g. "missing data" or "missing identifier", in addition to the code comment explanation of these.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've enhanced the debug messages to include the profile types for the problematic categories, as well as report the profile types for valid profiles being added/updated.

@rtamalin
Copy link
Collaborator Author

Determined that the issue was due to a recent upgrade of containerd.io on my system to 2.1.5, which has a very low default soft file limit, leading to problems when there were lots of active connections...

@rtamalin rtamalin marked this pull request as ready for review November 13, 2025 16:18
Capitalize comment sentence starting words.

Add extra comments to clarify how profiles are categorized, in detail
in the Profile.filter_profiles() method, more briefly in the handlers
for the announce_system and update requests.

Rename identify_existing_profiles() to identify_known_profiles() for
improved clarify of what the method is intended to do. Also tweak
associated variable names to match the rename change.
Add an addition unique index spanning the system_id and profile_id
fields in the system_profiles table to ensure that a given profile
can only be associated with a given system once.
logger.debug("problematic profiles detected: #{incomplete.count} incomplete, #{invalid.count} invalid")
response.headers['X-System-Profiles-Action'] = 'clear-cache'
end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  known_incomplete = Profile.identify_known_profiles(incomplete)

We should run the same identify_known_profiles check inside the announce call as well. There may be cases where a system sends two profiles with the same identifier but different profile types—one containing data and another without. In such situations, we still need to create the second profile and its corresponding system profile record.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I said in my response to @rjschwei if two different profile types have the same identfier, that is a valid scenario, and won't cause a problem, because profiles are uniquely identifier by the combination of (profile_type, identifier) not just identifier.

And by definition an announce_system requires complete profiles, i.e. ones that have both identifier and data, because the identifier only optimization is only supported for update requests; the only reason that a client should send up an incomplete profile is when it believes that it has previously sent up the complete profile with that identifier, and, because the profile hasn't changes, it can therefore send up the optimized incomplete profile next time... But an announce_system is part of an initial system registration, and a client cannot at that point make the assumption that it has sent up anything previously; it should always send up complete profiles.

The only likely time, in our current model of operation, that two different profile_types can validly have the same identifier value and same data content would be for the "empty" data blob. And I would prefer not to have to implement an unnecessarily complex mechanism to avoid storing an empty data field for multiple record in the profiles table. The DB storage cost for one extra record per profile type to store that profile type's version of the empty data blob is minimal vs the complexity of try to avoid storing that empty record. And under any other circumstances, if two different profile types have the same identifier, it would be an "unsafe" assumption to assume that their data blobs are in fact the same, given that the content format of the different profile type data blobs is very different, e.g. output lines from lscpi vs list of kernel modiles, vs list of packages and associated versions... If by some fluke occurrence we receive two different profile types with the same identifier for "non-empty" data blobs, we should be treating them as different data blobs.

Going one step further, if we ever decide to use a different mechanism for generating ids for new profile types in the future, then assuming that two different profile type's had the same data blob because their identifiers match would be invalid. The current model of operation support this future possibility without needing to change anything.

So to summarize, a profile is identified by the combination of it is profile_type and identifier, and we should consider each profile_type as an independent scope, and the existence of the same identifier value in multiple scopes is valid, and should not be taken to have any special meaning. This approach may result in some very minor extra DB storage usage to store independent versions of the "empty" data blob in each profile_type's scope, but that seems to me to be a minor cost vs the complexity and performance impact of the code needed to avoid this minor overhead.

Especially given that it is theoretically possible (though extremely improbably for the data blob sizes we are dealing with) for 2 different profile_type's with non-empty data blobs that are not equal to have the same identifier value.

validates :identifier, presence: true
validates :data, presence: true

def self.filter_profiles(profiles)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function currently considers profiles as complete if the identifier and data keys are present, even when the data value is empty. It might be more accurate to treat profiles with empty data as incomplete.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty data value is still a valid possibility, e.g. on an Azure VM we have seen the lspci output is empty, meaning that an empty data blob is a valid value to report for a pci_data profile type.

As such we consider empty data blobs as validly reportable values, and only consider a profile as incomplete if it doesn't contain the data entry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An empty data value is still a valid possibility, e.g. on an Azure VM we have seen the lspci output is empty, meaning that an empty data blob is a valid value to report for a pci_data profile type.

How is this possible ??

My perspective:
An empty data value should ideally never occur, since identifiers are hashed from the data value.
This situation can only arise from an incorrect implementation in SuseConnect or a client-side issue (i.e., negative scenarios). In the current implementation, the impact is that empty values may be stored for incorrectly reported profiles. This should not happen. We should treat such cases as invalid or incomplete — certainly not as complete. Invalid make more sense here to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, we'll let you @paragjain0910 argue that with the people at Microsoft as to why hyper-v does not expose anything to the kernel that will be listed with lspci. Maybe @olafhering has an idea why it is not possible for lspci to report data for some instances in Azure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also ask the question to Microsoft directly @brett060102 do you remember which VM size had this behavior?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Azure instance type is that? An ordinary Gen2 VM on Windows Server needs no physical or emulated PCI, the IO devices are exposed via the vmbus. A Gen1 VM will likely have a few emulated PCI devices.
In Azure a VM with accelerated networking should have the Mellanox card on the PCI bus. Newer v6 instances may use the mana driver, which may or may not be a PCI device (I do not have an instance running to verify how a mana interface is exposed).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said, in the case of PCI Data on Azure VMs, an empty value is valid, and is just the generated hash for the associated "empty" report for the profile type, which in the case of PCI Data could be "" or for something else could be the JSON representation an empty list/array ([]) or object ({}).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@olafhering Hopefilly @brett060102 can clarify with specifics, but from what I can remember they were relatively small & minimal instance types that had no output from lspci, and for beefier instance types there was limited output for a small number of devices, but nowhere near what would be seen for comparable instance types in AWS.
We surmised that it was due to the underlying system devices being presented via a different bus type, as you have confirmed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As said, in the case of PCI Data on Azure VMs, an empty value is valid, and is just the generated hash for the associated "empty" report for the profile type, which in the case of PCI Data could be "" or for something else could be the JSON representation an empty list/array ([]) or object ({}).

Well, yes, but @paragjain0910 was asking """How is this possible ??""". I think he deserves a little more of an answer than "We observed that lspci is empty on some Azure instances." Anyway, so that's what vmbus does, it handles the device information. Thanks @olafhering .

@paragjain0910 does that address your concern about empty PCI data and how that is possible?

We should consider profiles that have an empty identifier value as
invalid, so update the filter_profiles() method to check for and
treat them as invalid.
Update the System.complete_profiles=() method to avoid deletion and
recreation of linking records for profile associations that haven't
changed.
@ngetahun ngetahun added the 2 reviewers A second reviewer is requested. label Nov 17, 2025
@ngetahun ngetahun self-assigned this Nov 17, 2025
@rtamalin
Copy link
Collaborator Author

rtamalin commented Nov 17, 2025

@ngetahun We don't plan for this to merge until after the 2.24 release goes out as well... If there is some sort of label pattern that should be used to indicate that, I'm happy to use it. Or should I just go ahead and just add a Post-v2.24 label?

Never mind I see the 2.25 label - so I added that...

@rtamalin rtamalin added the 2.25 label Nov 17, 2025
Enhance the SystemSerializer to take an optional serialized_profiles
set as an initialize() argument, defaulting a new empty set if not
specified, and setup a serializer instance variable holding it.

This serialized_profiles instance variable set tracks profile.id's
and is used to determine if the serializer has previously serialized
a specific profile or not, with the first serialization including
the data field, and subsequent serializations dropping it.

Update the send_bulk_system_update() request generation to setup a
new serialized_profiles set for each batch of systems being processed
ensuring that only the first occurrence of a given profile includes
the data field.

Update tests to exercise the new SystemSerializer initialization and
optional system profiles data field inclusion, and verify that the
expected profiles are serialized by send_bulk_system_update().
Improve the debug messages logged by the announce_system and update
request handlers to report the profile types for problematic profiles
identified.

Additionally enhance the Profile.filter_profiles() method to return
hashes with symbolized keys to simplify determining which incomplete
profiles are unknown.

Only update the profiles associated with a system if valid complete
profiles were either provided or identified from incomplete profiles,
and add a test to ensure that existing profile associations are not
replaced if not valid complete profiles were provided in the update.
@rtamalin
Copy link
Collaborator Author

@paragjain0910 @felixsch @mssola I've updated the PR with the implementation of the optimized serialization of systems in the send_bulk_system_update() request payload. I identified relatively minimal changes to the existing problematic test case implementation to get it passing again.

I also spotted and fixed a minor error in the update request handler that could wipe existing profile associations if profiles were provided to the update, but all were either unknown incomplete or invalid profiles. An additional test case has been added to cover this scenario.

@rtamalin rtamalin marked this pull request as draft November 19, 2025 22:25
@rtamalin
Copy link
Collaborator Author

rtamalin commented Nov 19, 2025

Found an issue when stress testing under heavy load, have identified a promising fix that I will try out tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 reviewers A second reviewer is requested. 2.25

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants