fix(processor/github_metadata): match repos by full owner/name path #150
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes a bug in the repo matching logic of the metadata fetcher.
Before, I was just using the repo name to match GraphQL response data with the extracted repo from the URL. Problem is, if two repos have the same name, it'd match both and overwrite one repo, giving us bad data. That's what caused the bug in awesome-selfhosted/awesome-selfhosted-data#1828
Now it matches using the owner/repo combo, which is unique so we shouldn't see this issue again. Downside is if a repo gets renamed, matching will break for it, but that should get auto-reported via L336.
Before:
After:
Doing a quick scan with python, this means that probably 30 repos in our dataset currently have incorrect metadata.
Python Script Results
[REPO] Repository name: 'server' Found 7 projects with 7 different owners: - Bitwarden URL: https://github.com/bitwarden/server Owner/Repo: bitwarden/server - Gotify URL: https://github.com/gotify/server Owner/Repo: gotify/server - Nextcloud URL: https://github.com/nextcloud/server Owner/Repo: nextcloud/server - PushBits URL: https://github.com/pushbits/server Owner/Repo: pushbits/server - Screego URL: https://github.com/screego/server Owner/Repo: screego/server - Sync-in URL: https://github.com/Sync-in/server Owner/Repo: Sync-in/server - Traggo URL: https://github.com/traggo/server Owner/Repo: traggo/server [REPO] Repository name: 'core' Found 4 projects with 4 different owners: - Dovecot URL: https://github.com/dovecot/core Owner/Repo: dovecot/core - Gibbon URL: https://github.com/GibbonEdu/core Owner/Repo: GibbonEdu/core - Home Assistant URL: https://github.com/home-assistant/core Owner/Repo: home-assistant/core - ownCloud URL: https://github.com/owncloud/core Owner/Repo: owncloud/core [REPO] Repository name: 'app' Found 4 projects with 4 different owners: - LibreBooking URL: https://github.com/LibreBooking/app Owner/Repo: LibreBooking/app - Mybucks.online URL: https://github.com/mybucks-online/app Owner/Repo: mybucks-online/app - SimpleLogin URL: https://github.com/simple-login/app Owner/Repo: simple-login/app - Standard Notes URL: https://github.com/standardnotes/app Owner/Repo: standardnotes/app [REPO] Repository name: 'self-hosted' Found 3 projects with 3 different owners: - Revolt URL: https://github.com/revoltchat/self-hosted Owner/Repo: revoltchat/self-hosted - UI Bakery URL: https://github.com/uibakery/self-hosted Owner/Repo: uibakery/self-hosted - Webtor URL: https://github.com/webtor-io/self-hosted Owner/Repo: webtor-io/self-hosted [REPO] Repository name: 'homepage' Found 2 projects with 2 different owners: - Homepage by gethomepage URL: https://github.com/gethomepage/homepage Owner/Repo: gethomepage/homepage - Homepage by tomershvueli URL: https://github.com/tomershvueli/homepage Owner/Repo: tomershvueli/homepage [REPO] Repository name: 'homer' Found 2 projects with 2 different owners: - Homer URL: https://github.com/bastienwirtz/homer Owner/Repo: bastienwirtz/homer - SIPCAPTURE Homer URL: https://github.com/sipcapture/homer Owner/Repo: sipcapture/homer [REPO] Repository name: 'platform' Found 2 projects with 2 different owners: - Huly URL: https://github.com/hcengineering/platform Owner/Repo: hcengineering/platform - Syncloud URL: https://github.com/syncloud/platform Owner/Repo: syncloud/platform [REPO] Repository name: 'medusa' Found 2 projects with 2 different owners: - Medusa URL: https://github.com/pymedusa/Medusa Owner/Repo: pymedusa/Medusa - MedusaJs URL: https://github.com/medusajs/medusa Owner/Repo: medusajs/medusa [REPO] Repository name: 'panel' Found 2 projects with 2 different owners: - Pelican Panel URL: https://github.com/pelican-dev/panel Owner/Repo: pelican-dev/panel - Pterodactyl URL: https://github.com/pterodactyl/panel Owner/Repo: pterodactyl/panel [REPO] Repository name: 'analytics' Found 2 projects with 2 different owners: - Plausible Analytics URL: https://github.com/plausible/analytics/ Owner/Repo: plausible/analytics - Prisme Analytics URL: https://github.com/prismelabs/analytics Owner/Repo: prismelabs/analytics ================================================================================ SUMMARY: Total conflicting repository names: 10 Total projects potentially affected: 30 ================================================================================Just a heads up, even after merging this, it'll only partially fix things since we're currently only updating the last 3 months of the commit_history. We're already at 4 months now, so some data is still wrong. If you are fine with it, I would run the metadafetcher manually once to fix this with 4 months (or honestly, since I'm doing it manually anyway, I could go ahead and do the full 12 months to have a complete dataset).