Skip to content

Conversation

@brenns10
Copy link
Member

@brenns10 brenns10 commented Jul 9, 2025

I apologize in advance for the size of this pull request, as well as the relative disorganization of the commits. My design for this changed one or twice during the implementation, and I didn't end up squashing those changes down. The result is that the branch shows the evolution of the design, which is reasonable, but I wouldn't necessarily recommend reviewing individual commits, except to read the messages with rationales.

This adopts the module API and moves our minimum drgn version to 0.0.32 (despite what the branch name says). Since this is the latest, and the last version supporting Python 3.6, we will likely be using this as our minimum for quite a long time. As such, I didn't see any reason to keep support for any older version.

The goals here were:

  1. Remove code that reimplemented parts of drgn's debug info finding logic.
  2. Make Corelens more reliable for detecting module debuginfo.
  3. Get rid of the bespoke debuginfo loading mechanisms for corelens and drgn_tools.cli, instead relying on drgn debug info finders, and making their behavior more consistent.
  4. With the power of debuginfo finders & plugins, add the ability to download and extract OL debuginfo RPMs to all commands, including the drgn command.
  5. Remove old, dead code that supported older drgn versions. EG: code upstreamed to drgn.

This adds a drgn plugin, with several debug info finders. It revamps the (hardly ever used) drgn-tools config file. It does modify some elements of the corelens & drgn_tools.cli arguments. It removes the KernelModule class. And there are several other major changes. I've written a document that goes into far more detail than I feel comfortable writing in a PR description. I'll share it to those curious.

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Jul 9, 2025
brenns10 added 12 commits July 9, 2025 16:44
Drgn 0.0.31 brings huge improvements, especially with the module API and
plugins. It also is the last release to officially support Python 3.6
upstream. As a result, it should serve as an excellent long-term minimum
requirement, allowing us to assume these features are available and drop
support code for older versions.

Signed-off-by: Stephen Brennan <[email protected]>
Now that drgn 0.0.31 is our minimum, we can assume these helpers are
provided by drgn. Drop their implementations and import them directly
from drgn.

Signed-off-by: Stephen Brennan <[email protected]>
This helper was upstreamed into d_path(). The sb_first_mount_point()
function needs to remain, because it is used by the fsnotify helpers.

Signed-off-by: Stephen Brennan <[email protected]>
The frame_name() implementation uses KernelModule to find the module
corresponding to an address in order to label the stack frame with the
module. This can now be done more efficiently with drgn's module API.

frame_name() also implements fallingback logic for finding the frame
name. When DWARF is not present, it falls back to a symbol, and if
that's not found, it falls back to an address, or worst case, the
hexadecimal code address. But since drgn 0.0.31, this exact fallback
sequince is now built-in to StackFrame.name.

The only difference is that our version doesn't require that the user
has first loaded the "module_kallsyms" symbol finder, whereas drgn's
does, because it only uses the symbol finder API to lookup symbols. As a
result, we'll need to make sure that all callers have loaded it. The
major ones will be (a) corelens, (b) drgn_tools.cli, and (c) any caller
who directly uses "drgn" to execute a script. (a) and (b) can be solved
easily in drgn-tools, but (c) will require a new drgn plugin to handle
proper debuginfo handling.

Signed-off-by: Stephen Brennan <[email protected]>
Drgn 0.0.31 introduces the module API, and every kernel module will have
an automatically-detected drgn.Module associated with it by drgn. This
Module is the key to seeing information about whether debuginfo is
loaded, and it also has efficient lookup APIs. Remove duplicate and
unused functionality drgn_tools.module and rely on drgn.Module for what
we need.

Some functionality, like get_symbol(), is now totally unused. That code
also relates to the ModuleSymbolFinder, which is similarly unused, and
even if it were, it could be replaced by the SymbolIndex which I've
upstreamed to drgn. The only symbol-releated functionality which still
belongs is module_exports() / KernelModule.exports().

Signed-off-by: Stephen Brennan <[email protected]>
Drgn's logging, in general, is quite useful, and in the future we may
include a way to control the log level used by corelens & drgn. However,
drgn logs a warning for missing debug files when you use the
load_default_debug_info() function, which ruins the corelens output. We
really don't want this, because corelens already handles module
debuginfo. Add a filter to the drgn logger in order to handle this.

While we're at it, we can simplify our logging in Corelens by quite a
bit, by using the logging module instead of the existing bespoke system.
Thankfully, logging has all the handler components that are necessary to
easily reproduce the same behavior.

Signed-off-by: Stephen Brennan <[email protected]>
In the happy case, every kernel module which is not OOT-tainted will be
found in a kernel RPM. This is true in the overwhelming majority of
cases, but a few edge cases can cause undesirable behavior.

1. Ksplice cold-patch modules are expected to be mismatched. The
shipped debuginfo doesn't apply for them.
2. Some old kernels had module build ID mismatches which cannot be
corrected.

In either case, drgn will reject the debuginfo and continue calling
debuginfo finders. This can result in our finders being called multiple
times, pointlessly downloading and/or extracting RPMs multiple times.

To handle cold-patches, we just don't consider them in-tree modules. To
handle the remaining build ID mismatches, we record all module names for
which we've attempted RPM extraction. If we're later called, we check
each module against that set and avoid re-doing the work.

Signed-off-by: Stephen Brennan <[email protected]>
Refactor the finders into methods of a single class. This allows sharing
important cached info.

Establish consistent arguments for corelens and drgn_tools.cli. The
config file can only configure drgn -- corelens and drgn_tools.cli have
their own set behaviors. While we're at it, update the corelens manpage.

Signed-off-by: Stephen Brennan <[email protected]>
Debug info searches go in two steps: (1) kernel, (2) modules. You cannot
know which modules must be loaded until the kernel is loaded. This is a
problem for RPM downloading and extraction. While we can't avoid the
double-extraction, it is pretty wasteful to download the RPM once for
each step.

The rpm_cache configuration was originally conceived as a band-aid over
this issue. It lets the RPM get saved persistently, but it has the nice
upside that the ol-local-rpm would find the saved RPM and ensure that
the ol-download finder doesn't get called again. That's nice, but in
reality we probably don't want rpm_cache enabled all that often.
Instead, persist the downloaded file until the program exits. Ensure
that even if ol-local-rpm is disabled, the ol-download finder will still
call it with the cached RPM as necessary.

Signed-off-by: Stephen Brennan <[email protected]>
Stop using KernelModule, add support for ksplice cold-patched modules,
and add a verbose printing mode that shows all the modules.

Signed-off-by: Stephen Brennan <[email protected]>
brenns10 added 11 commits July 14, 2025 09:57
Now that we've migrated to the Module API, the KernelModule was nothing
but a less efficient wrapper over the Module / struct module objects.
Let's not keep it around.

Signed-off-by: Stephen Brennan <[email protected]>
This is both unused, and can trivially be provided by the
Module.build_id attribute.

Signed-off-by: Stephen Brennan <[email protected]>
The "module" module is much smaller now! But some of the remaining
members are no longer included in __all__. Fix that.

Signed-off-by: Stephen Brennan <[email protected]>
Drgn 0.0.32 was unexpectedly released with some bug fixes and
improvements. It is still Python 3.6 compatible. We will adopt it as the
new minimum version going forward.

Signed-off-by: Stephen Brennan <[email protected]>
This reverts commit 1e20fbe.

Now that our minimum drgn version is 0.0.32, we don't need the fix for
skipping cursors.

Signed-off-by: Stephen Brennan <[email protected]>
Using found_count == max_pages as the loop break condition can fail, if
we hit the "inode_path(inode) is None" case on line 238 when found_count
== max_pages. Use a >= comparison so that if we've overshot, we still
break out of the loop.

This fixes uncommon issues where the test framework spends a long time
dumping pages.

Signed-off-by: Stephen Brennan <[email protected]>
The "maxitems" parameter set at 10,000 is tracked per mount, which means
that the list_lru test can output far more than that. Drop this down to
100 per mount, only allow the test to process 20 mounts before
completing.

The purpose of the smoke test is just to verify that we have iterated
over the data structures correctly. Having small limits is fine for that
purpose. Custom (or zero) amounts can still be specified for those
interested in running the full operation.

Signed-off-by: Stephen Brennan <[email protected]>
Signed-off-by: Stephen Brennan <[email protected]>
@brenns10 brenns10 force-pushed the drgn_0.0.31_minimum branch 2 times, most recently from ea697b9 to b6c29d7 Compare July 14, 2025 20:28
For objects with file size less than 100 MiB, we use a direct PUT
request to send the data, rather than a multipart upload. However,
rather than passing the block of data that we already read from the file
(first_chunk), we are reading from the now exhausted file, so we always
upload empty data. Fix this so that we get correct uploads.

Signed-off-by: Stephen Brennan <[email protected]>
@brenns10 brenns10 force-pushed the drgn_0.0.31_minimum branch from b6c29d7 to deb8183 Compare July 14, 2025 20:43
@brenns10
Copy link
Member Author

Ok, finally got all tests passing internal & external. I think this is ready for review.

@brenns10 brenns10 requested a review from biger410 July 14, 2025 21:06
@brenns10
Copy link
Member Author

Hey @Asphaltt - you made some changes to debuginfo lookup paths in #166. This is the big refactor I mentioned. I know it's way too large to ask you to read through, but the core change is that now, we'll rely on drgn's built-in debuginfo finder, and we just add on our own logic for anything Oracle specific. So ideally, this should work for other distros with no extra effort. If you're able to test it out for your use cases, I'd love to hear your feedback. No worries if you can't though.

Copy link
Member

@biger410 biger410 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@brenns10 brenns10 merged commit e3080da into oracle-samples:main Jul 15, 2025
4 checks passed
@brenns10 brenns10 deleted the drgn_0.0.31_minimum branch July 15, 2025 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants