Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PICARD-2655: Improve matching for VA releases #1918
base: master
Are you sure you want to change the base?
PICARD-2655: Improve matching for VA releases #1918
Changes from all commits
86fee33
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am entirely unsure that this logic is at all correct for various reasons (though it may equally be that I don't understand how this is supposed to work) but:
I think that we should be considering a situation where we give the artist similarity a value of 1 ONLY if the album artist is "various" or "various artists" AND the release we are matching has a VA_ID. As the code stands if the album artist is "The Beatles" and the release we are matching is VA, artist similarity is 1. Or are we going to assume that any release which is VA is effectively going to match any album artist regardless of whether it is "various"?
What happens if the album is not in english e.g. the album is a portuguese language album and the album artist is "vários artistas"? I think we need a much longer list of VA textual names which covers all languages that we support. (Edit: Not sure how text comparisons work where there are accents - but we might want not only to make a lower case comparison as in the current code, but also to remove accents.)
I am unclear why we are doubling the weighting for artist if the album is "various" rather than halving it. In the specific use case described in the PR description, if the album artist is "various" and it is failing to match a release because the release artist is the label rather than "various" then surely we should give less weight to the artists not matching rather than more weight?
Do we need some unit tests with various albums and various potential releases to test that we get the best release selected?
Edited: to add note about accented characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love have tests for that... thanks for volunteering ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am happy to volunteer to write tests for this providing that you are prepared to wait until I have the time to spend on open source and for this to get to the top of my queue of open source stuff I have planned. 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Sophist-UK that we likely should combine the tests of tags indicating VA and the release being VA.
The goal of the patch is to improve the matching of files from a VA release to VA. And one of the problems here is that pure matching of album artist tag with the MB artist name ("Various Artists") often does not yield good results.
The code will indeed improve the situation. But at the same time it will make matching a VA release from MB to any file more likely, as all VA releases will always have full albumartist similarity.That might increase the problem that files get tagged too much against compilation, something many users already struggle with.
Combining the check of existing album artist tags indicating VA and the release being VA would avoid this. For the localization I would suggest to check for
a.lower() in {'various', 'various artists', config.setting['va_name'].lower()}
.This will not solve the case where the album artist is set to the label name or something completely different, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@phw
A. I am really not sure that we should use
config.setting['va_name']
as this is stated in the documentation to be the output name when you use a VA_ID release - not the least for file naming so that the user can group together albums) - so this is (usually) the users local idiom for VA. But (clearly) a user can be trying to match a VA album from a different country, so Picard needs to recognise these. This is why I suggested a list (constant -> one off effort) of V/VA for every locale supported. And this same list could then be used by MB for helping improve the data, so the same effort could then also be reused. (P.S. It won't be perfect as a first iteration, but Google / Deepl translate can give you a first cut for translations of VA without too much effort. If it will help I can find the time do this and post the results here for reuse.)B. There are probably four use cases to consider, two where the album artist is set to the label name:
Of course the principle cause of all three use cases is bad data - so the primary fix should be to have an MB report to identify use cases 1, 3 and 4 and have editors fix the data. (Or indeed, have some code for case 3 to auto-create (voteable) edits to fix the album artist MBID.) However, as per this PR Picard should still try to handle bad data because it will never be perfect - and the code should probably deal with cases 1 and 3 and perhaps also 4 (since it is only tweaking the fuzzy match and not making hard changes to the data).
C. That said, we should also consider whether we want Picard to change the data if this release is chosen. If we genuinely believe that use cases 1, 2 and 4 are detectable and we are tweaking the fuzzy search, should we also tweak the MB data to "improve" it? (This is much more contentious, and I put this up for discussion because I have not yet come to an ethical/practical personal conclusion.) If we do this, we would have to do it elsewhere in the code because once the Album has an Album MBID, fuzzy search code is not executed.
D. I think that fuzzy match fudges like this to attempt to correct bad MBID data should have an
options
switch to enable / disable them. And if we decide to fix the data too, that definitely should have a switch.