Description
Title
The mystery of the remnant symbol: software, society, and our digital heritage
Describe your Talk
This talk is about my experience of contributing to an open-source tool called pydistcheck
. This package is a linter that finds portability issues in Python distributions (sdists, wheels, and conda archives). It came into existence after the author's SciPy 2022 talk "Does that CSV Belong on PyPI? Probably Not", which dealt with topics directly inspiring its creation. Specifically, I put together a pull request at jameslamb/pydistcheck#310, where I addressed an issue related to Apple, Inc.'s strip
system utility when used during the build process of Python packages with compiled extensions on macOS machines.
The strip
program on macOS devices has an interesting quirk: it adds a specific weak symbol radr://5614542
(see Radar, Apple's legacy bug tracking system) in Mach-O files, which is detected as a debug symbol. It appears even in properly stripped binaries, despite stripping, as an artifact of the process – whether when running manually or at the linkage step of a build system. At the time of writing, this behaviour does not exist in third-party linkers for macOS, such as zld
or the later alternative lld
. However, it is still known to exist in the newer variant of Apple's static linker, ld-prime
. The source code for ld-prime
has not been released yet.
As a part of my investigation into this issue, I discovered that several pages from https://opensource.apple.com/ have been wiped into oblivion, especially pages to the original strip
source code, and the Internet Archive (at https://archive.org/ and its mirrors at https://archive.pw and https://archive.is) were of hardly any help, as Apple was able to take down pages from there as well. I faced some struggle to find just one working link to Apple's cctools
source, and I was able to grab one that had surprisingly not been taken down. Given that I have the original strip.c
source code, I now have an understanding of what the Radar bug is related to: the old classic linker misbehaves if an "indirect symbol" appears at the 0th index in the symbol table. Indirect symbols are like forwarding addresses; they point to other symbols and are used for things like functions in shared libraries. This was worked around by adding this dummy symbol at the 0th index when there was a risk of indirect symbols landing there. In my talk, I plan to describe this behaviour (albeit briefly) in a better manner through appropriate code snippets from the source code and my understanding of the issue and the insights I've gained. While this symbol has been proven harmless, it continues to appear sporadically in internet culture. Here's a search result from Twitter, for example: https://x.com/search?q=radr://5614542&src=typed_query&f=top.
At the time of this talk, we're facing a crisis in knowledge dissemination. This bug led me to think about how the requisite documentation for understanding fundamental computing infrastructure, whether it's for macOS or other open-source code whose later versions are now proprietary, has just... disappeared from the public. It is not ideal for a programmer to mimic the work of a digital archaeologist in exploring bugs that have a history spanning two decades or more for code we write every day, regardless of how intriguing that line of work may be. While this is the work of a big tech oligarch like Apple, Inc., I opine that technical gatekeeping of this form echoes the sentiments of a larger pattern in the 2025 political landscape that I want to share with the audience, as described below.
In 2025, POTUS Donald Trump and the far-right GOP have had America in shambles through their fascist agenda. Over 8,000 web pages and approximately 3,000 datasets have been removed or modified across federal agencies. NASA had to comply and henceforth undertook the process of a comprehensive removal of DEIA-related content, including interviews with Black and female NASA employees and LGBTQ-related content. While this hasn't been primarily about the removal of scientific data removal, it is illustrative of the fact that institutional knowledge and open data established over half a century or longer can swiftly vanish in the wake of the hour following administrative orders, with hardly any time for recourse.
I'd like to further pivot briefly into two case studies for preserving data and code:
- The Bitbucket Mercurial crises in 2020: Bitbucket had supported Mercurial since 2008, but by 2020, had announced the end of support and a future complete removal of all Mercurial repositories. Despite Mercurial's usage declining to less than 1% of new users, owing to the popularity of other version control systems, this ultimately affected over 250,000 existing repositories. The Software Heritage non-profit organisation was able to rescue most of it, if not all: https://octobus.net/blog/2020-08-05-bitbucket-public-archive.html.
- The Apollo 11 source code that powered humanity's first moon landing was nearly lost forever until amateur enthusiast Ron Burkey initiated the Virtual AGC Project in 2003 to transcribe code from 1960s hard copies held by the MIT Museum. The code now exists at https://www.ibiblio.org/apollo and https://github.com/chrislgarry/Apollo-11.
Lastly, I plan to end by describing how the audience can ensure that their code is archived and preserved: don't rely solely on GitHub and rather mirror to multiple platforms, use Software Heritage to archive (necessary) code, and think about broader themes of digital preservation and social justice. The main takeaways are to understand that every technical problem has social dimensions and that preservation is a political issue; those who control technical knowledge will shape who can innovate. I hope these the audience will find such steps for developers and citizens to preserve and democratise technical knowledge pragmatic for their usage.
Pre-requisites & reading material
- Some information about distributing Python code to PyPI and/or other package indices (not required, but helpful to have); and
- An understanding of how to compile code into binaries and run them: what (Unix) object files are, what symbols are, how to use a compiler and linker; and
- A general interest in the politics and preservation of code and data
I do not expect the general audience to be familiar with these details, but having an overview of these would be beneficial. Given that the June edition is also targeted at a Linux-centric audience following a collaboration with ILUG-D, I expect the audience coming from the ILUG-D community and those acquainted with Linux/Unix concepts to be able to follow along with relative ease. I plan to cover these topics on a rudimentary basis instead of diving into deeper explanations, as the talk is thirty minutes long.
Resources
For more information on radr://5624542
- SYMBOLS STRIPPED False positive MobSF/Mobile-Security-Framework-MobSF#1917 (comment)
- [bug] 'compiled-objects-have-debug-symbols' false positive on mach-o files that have been passed through 'strip' jameslamb/pydistcheck#235
- https://stackoverflow.com/questions/52091210/why-osxs-strip-can-not-remove-weak-symbols
The political state of the U.S.A. in 2025
- https://en.wikipedia.org/wiki/2025_United_States_government_online_resource_removals
- https://web.archive.org/web/20250127173130/https://github.com/nasa/Transform-to-Open-Science/commit/bb7560bd1c35f6e2e200a1ad78f4c78d28ab282b
- https://profmattstrassler.com/2025/02/09/an-attack-on-us-universities/
- https://www.nature.com/articles/d41586-025-01547-5
- https://www.bbc.com/future/article/20250422-usa-scientists-race-to-save-climate-data-before-its-deleted-by-the-trump-administration
The Apollo 11 story
- https://www.americanscientist.org/article/moonshot-computing
- https://abcnews.go.com/Technology/apollo-11s-source-code-tons-easter-eggs-including/story?id=40515222
- https://github.com/chrislgarry/Apollo-11
Bitbucket and Mercurial, Software Heritage, digital preservation, and more
- https://www.atlassian.com/blog/bitbucket/sunsetting-mercurial-support-in-bitbucket
- https://community.atlassian.com/t5/Bitbucket-questions/Multiple-missing-repositories/qaq-p/1859166
- https://news.ycombinator.com/item?id=20745393
- https://www.softwareheritage.org/mission/
- https://www.archivematica.org/en/
- https://archive.org/details/19Roberto-Di-Cosmo
- https://en.wikipedia.org/wiki/Digital_preservation
- https://www.eosc-pillar.eu/use-cases/software-source-code-preservation-reference-access
Time required for the talk
Twenty-five minutes; five minutes for questions
Link to slides/demos
No response
About you
I am a software engineer at Quansight, where I work on open-source scientific software in the Scientific Python and PyData ecosystems. My interests include Python packaging, compilers and toolchains, documentation and technical writing, as well as numerical software, among other areas. I spend my time working on Pyodide, JupyterLite, and various other open-source scientific software projects, and I enjoy the capacities in which they all interoperate.
Bluesky: @agriyakhetarp.al
Mastodon: fosstodon.org/@agriyakhetarpal
Twitter: @agriyakhetarpal
LinkedIn: linkedin.com/in/agriyakhetarpal
Email address: agriyakhetarpal [at] outlook [dot] com
I've previously spoken at a PyDelhi meetup last year, in July 2024. Here's my submission from that time: #285
Availability
21/06/2025
Any comments
N/A