Skip to content

Conversation

@Rabenherz112
Copy link
Contributor

@Rabenherz112 Rabenherz112 commented Nov 29, 2025

This fixes a bug in the repo matching logic of the metadata fetcher.
Before, I was just using the repo name to match GraphQL response data with the extracted repo from the URL. Problem is, if two repos have the same name, it'd match both and overwrite one repo, giving us bad data. That's what caused the bug in awesome-selfhosted/awesome-selfhosted-data#1828

Now it matches using the owner/repo combo, which is unique so we shouldn't see this issue again. Downside is if a repo gets renamed, matching will break for it, but that should get auto-reported via L336.

Before:

DEBUG:github_metadata.py: Matched repository https://github.com/gethomepage/homepage to software entry: Homepage by gethomepage
DEBUG:github_metadata.py: Software "Homepage by gethomepage" current metadata - stargazers: 320, updated_at: 2021-05-09, archived: False
DEBUG:github_metadata.py: Software "Homepage by gethomepage" final metadata to be written:
DEBUG:github_metadata.py: Successfully wrote YAML file for software "Homepage by gethomepage"
DEBUG:github_metadata.py: Matched repository https://github.com/tomershvueli/homepage to software entry: Homepage by gethomepage
DEBUG:github_metadata.py: Software "Homepage by gethomepage" current metadata - stargazers: 26977, updated_at: 2025-11-25, archived: False
DEBUG:github_metadata.py: Software "Homepage by gethomepage" final metadata to be written:
DEBUG:github_metadata.py: Successfully wrote YAML file for software "Homepage by gethomepage"

After:

DEBUG:github_metadata.py: Matched repository https://github.com/gethomepage/homepage to software entry: Homepage by gethomepage
DEBUG:github_metadata.py: Software "Homepage by gethomepage" current metadata - stargazers: 320, updated_at: 2021-05-09, archived: False
DEBUG:github_metadata.py: Software "Homepage by gethomepage" final metadata to be written:
DEBUG:github_metadata.py: Successfully wrote YAML file for software "Homepage by gethomepage"
DEBUG:github_metadata.py: Matched repository https://github.com/tomershvueli/homepage to software entry: Homepage by tomershvueli
DEBUG:github_metadata.py: Software "Homepage by tomershvueli" current metadata - stargazers: 320, updated_at: 2021-05-09, archived: False
DEBUG:github_metadata.py: Software "Homepage by tomershvueli" final metadata to be written:
DEBUG:github_metadata.py: Successfully wrote YAML file for software "Homepage by tomershvueli"

Doing a quick scan with python, this means that probably 30 repos in our dataset currently have incorrect metadata.

Python Script Results
[REPO] Repository name: 'server'
   Found 7 projects with 7 different owners:
   - Bitwarden
     URL: https://github.com/bitwarden/server
     Owner/Repo: bitwarden/server
   - Gotify
     URL: https://github.com/gotify/server
     Owner/Repo: gotify/server
   - Nextcloud
     URL: https://github.com/nextcloud/server
     Owner/Repo: nextcloud/server
   - PushBits
     URL: https://github.com/pushbits/server
     Owner/Repo: pushbits/server
   - Screego
     URL: https://github.com/screego/server
     Owner/Repo: screego/server
   - Sync-in
     URL: https://github.com/Sync-in/server
     Owner/Repo: Sync-in/server
   - Traggo
     URL: https://github.com/traggo/server
     Owner/Repo: traggo/server

[REPO] Repository name: 'core'
   Found 4 projects with 4 different owners:
   - Dovecot
     URL: https://github.com/dovecot/core
     Owner/Repo: dovecot/core
   - Gibbon
     URL: https://github.com/GibbonEdu/core
     Owner/Repo: GibbonEdu/core
   - Home Assistant
     URL: https://github.com/home-assistant/core
     Owner/Repo: home-assistant/core
   - ownCloud
     URL: https://github.com/owncloud/core
     Owner/Repo: owncloud/core

[REPO] Repository name: 'app'
   Found 4 projects with 4 different owners:
   - LibreBooking
     URL: https://github.com/LibreBooking/app
     Owner/Repo: LibreBooking/app
   - Mybucks.online
     URL: https://github.com/mybucks-online/app
     Owner/Repo: mybucks-online/app
   - SimpleLogin
     URL: https://github.com/simple-login/app
     Owner/Repo: simple-login/app
   - Standard Notes
     URL: https://github.com/standardnotes/app
     Owner/Repo: standardnotes/app

[REPO] Repository name: 'self-hosted'
   Found 3 projects with 3 different owners:
   - Revolt
     URL: https://github.com/revoltchat/self-hosted
     Owner/Repo: revoltchat/self-hosted
   - UI Bakery
     URL: https://github.com/uibakery/self-hosted
     Owner/Repo: uibakery/self-hosted
   - Webtor
     URL: https://github.com/webtor-io/self-hosted
     Owner/Repo: webtor-io/self-hosted

[REPO] Repository name: 'homepage'
   Found 2 projects with 2 different owners:
   - Homepage by gethomepage
     URL: https://github.com/gethomepage/homepage
     Owner/Repo: gethomepage/homepage
   - Homepage by tomershvueli
     URL: https://github.com/tomershvueli/homepage
     Owner/Repo: tomershvueli/homepage

[REPO] Repository name: 'homer'
   Found 2 projects with 2 different owners:
   - Homer
     URL: https://github.com/bastienwirtz/homer
     Owner/Repo: bastienwirtz/homer
   - SIPCAPTURE Homer
     URL: https://github.com/sipcapture/homer
     Owner/Repo: sipcapture/homer

[REPO] Repository name: 'platform'
   Found 2 projects with 2 different owners:
   - Huly
     URL: https://github.com/hcengineering/platform
     Owner/Repo: hcengineering/platform
   - Syncloud
     URL: https://github.com/syncloud/platform
     Owner/Repo: syncloud/platform

[REPO] Repository name: 'medusa'
   Found 2 projects with 2 different owners:
   - Medusa
     URL: https://github.com/pymedusa/Medusa
     Owner/Repo: pymedusa/Medusa
   - MedusaJs
     URL: https://github.com/medusajs/medusa
     Owner/Repo: medusajs/medusa

[REPO] Repository name: 'panel'
   Found 2 projects with 2 different owners:
   - Pelican Panel
     URL: https://github.com/pelican-dev/panel
     Owner/Repo: pelican-dev/panel
   - Pterodactyl
     URL: https://github.com/pterodactyl/panel
     Owner/Repo: pterodactyl/panel

[REPO] Repository name: 'analytics'
   Found 2 projects with 2 different owners:
   - Plausible Analytics
     URL: https://github.com/plausible/analytics/
     Owner/Repo: plausible/analytics
   - Prisme Analytics
     URL: https://github.com/prismelabs/analytics
     Owner/Repo: prismelabs/analytics

================================================================================
SUMMARY:
  Total conflicting repository names: 10
  Total projects potentially affected: 30
================================================================================

Just a heads up, even after merging this, it'll only partially fix things since we're currently only updating the last 3 months of the commit_history. We're already at 4 months now, so some data is still wrong. If you are fine with it, I would run the metadafetcher manually once to fix this with 4 months (or honestly, since I'm doing it manually anyway, I could go ahead and do the full 12 months to have a complete dataset).

@Rabenherz112
Copy link
Contributor Author

Test fails, don't seem to come from my change.

@nodiscc nodiscc self-assigned this Nov 29, 2025
@nodiscc nodiscc self-requested a review November 29, 2025 19:50
@nodiscc nodiscc added the bug Something isn't working label Nov 29, 2025
@nodiscc
Copy link
Owner

nodiscc commented Nov 29, 2025

Yeah, it's "that" test again. Fixed in e9d71ae, can you please rebase on master?

@nodiscc nodiscc force-pushed the fix-github-metadata-url-matcher branch from f86a5b8 to fd665e7 Compare November 29, 2025 20:02
@nodiscc
Copy link
Owner

nodiscc commented Nov 29, 2025

Nevermind, Github allows rebasing from the Web UI now 👍

Copy link
Owner

@nodiscc nodiscc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@nodiscc nodiscc merged commit cec628d into nodiscc:master Nov 29, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants