-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CD-Text cannot be parsed, raising ValueErrors #169
Comments
This is code in question (image/toc.py):
|
This patch fixes it for me: diff --git a/morituri/image/toc.py b/morituri/image/toc.py
index c83e940..7a5dab4 100644
--- a/morituri/image/toc.py
+++ b/morituri/image/toc.py
@@ -195,6 +195,15 @@ class TocFile(object, log.Loggable):
if m:
key = m.group('key')
value = m.group('value')
+
+ # Occasionally, CDRDAO will contain CD-TEXT that ends with a
+ # slash. This will case value.decode('string-escape') to fail
+ # with a ValueError. We could -catch- that exception, but it
+ # might be more clean to just strip the trailing slash, as that
+ # seems to be the main/only issue right now.
+ while value.endswith('\\'):
+ value = value[:-1]
+
# usually, value is encoded with octal escapes and in latin-1
# FIXME: other encodings are possible, does cdrdao handle
# them ? |
Maybe we can just merge this? Seems useful and also harmless. |
Maybe the double
https://www.gnu.org/software/libcdio/cd-text-format.html#Pack-Contents |
Found the explanation:
CD-Text can also use the Shift JIS double byte encoding which, I think, cdrdao supports (and we're not handling in whipper). It's not allowed to mix Shift JIS and IEC 8859-1 (latin-1).
|
What's the purpose of the |
Since cdrdao 1.2.5 (2023-02-03), by default the .toc file encodes strings in UTF-8 and backslash escape octal sequences are no longer used. The old behavior can still be accessed by passing Even if that bug is eventually fixed and whipper passed # usually, value is encoded with octal escapes and in latin-1
# FIXME: other encodings are possible, does cdrdao handle
# them ?
value = value.encode().decode('unicode_escape') The 'unicode_escape' works only for ASCII and Latin-1, and only because their codepoints match exactly into Unicode codepoints. But CD-Text can also be encoded in MS-JIS (Shift-JIS) and treating those byte values as Latin-1 would produce mojibake. |
This patch replaces the previous broken approach to TOC string decoding that used `.encode().decode('unicode_escape')` with proper parsing of the escape sequences cdrdao is known to generate. The new parser is also lenient with invalid escape sequences, that can occur due to improper escaping in cdrdao. See: cdrdao/cdrdao#32 Latin-1: This new parsing method should work for Latin-1 strings for both old and new versions of cdrdao, as long as those strings don't trigger the improper escaping issues in upstream cdrdao. This has been verified with the album Diorama from the Danish black metal band MØL. MS-JIS: This new parsing method should also work for MS-JIS strings as long as the .toc file was generated by cdrdao 1.2.5+ and the strings don't trigger improper escaping issues in upstream cdrdao. Unfortunately, I don't have any CD with CD-Text in MS-JIS, so I could not verify this. cdrdao versions before 1.2.5 will still cause whipper to produce mojibake (garbled characters) when reading MS-JIS CD-Text, as those versions do not encode strings in UTF-8. Other encodings: As far as I know, CD-Text only supports officially ASCII, Latin-1 and MS-JIS, but I wouldn't be surprised if there are unofficial encodings out there, given the strange strings I've seen in some bug reports. If you have a CD with garbled CD-Text, please submit a bug report indicating the performer, album name, language and attach the .toc file so that the produced strings can be compared to the expected text. Fixes whipper-team#169 Fixes whipper-team#183 Signed-off-by: Alicia Boya García <[email protected]>
This patch replaces the previous broken approach to TOC string decoding that used `.encode().decode('unicode_escape')` with proper parsing of the escape sequences cdrdao is known to generate. The new parser is also lenient with invalid escape sequences, that can occur due to improper escaping in cdrdao. See: cdrdao/cdrdao#32 Latin-1: This new parsing method should work for Latin-1 strings for both old and new versions of cdrdao, as long as those strings don't trigger the improper escaping issues in upstream cdrdao. This has been verified with the album Diorama from the Danish black metal band MØL. MS-JIS: This new parsing method should also work for MS-JIS strings as long as the .toc file was generated by cdrdao 1.2.5+ and the strings don't trigger improper escaping issues in upstream cdrdao. Unfortunately, I don't have any CD with CD-Text in MS-JIS, so I could not verify this. cdrdao versions before 1.2.5 will still cause whipper to produce mojibake (garbled characters) when reading MS-JIS CD-Text, as those versions do not encode strings in UTF-8. Other encodings: As far as I know, CD-Text only supports officially ASCII, Latin-1 and MS-JIS, but I wouldn't be surprised if there are unofficial encodings out there, given the strange strings I've seen in some bug reports. If you have a CD with garbled CD-Text, please submit a bug report indicating the performer, album name, language and attach the .toc file so that the produced strings can be compared to the expected text. Fixes whipper-team#169 Fixes whipper-team#183 Signed-off-by: Alicia Boya García <[email protected]>
This patch replaces the previous broken approach to TOC string decoding that used `.encode().decode('unicode_escape')` with proper parsing of the escape sequences cdrdao is known to generate. The new parser is also lenient with invalid escape sequences, that can occur due to improper escaping in cdrdao. See: cdrdao/cdrdao#32 Latin-1: This new parsing method should work for Latin-1 strings for both old and new versions of cdrdao, as long as those strings don't trigger the improper escaping issues in upstream cdrdao. This has been verified with the album Diorama from the Danish black metal band MØL. MS-JIS: This new parsing method should also work for MS-JIS strings as long as the .toc file was generated by cdrdao 1.2.5+ and the strings don't trigger improper escaping issues in upstream cdrdao. Unfortunately, I don't have any CD with CD-Text in MS-JIS, so I could not verify this. cdrdao versions before 1.2.5 will still cause whipper to produce mojibake (garbled characters) when reading MS-JIS CD-Text, as those versions do not encode strings in UTF-8. Other encodings: As far as I know, CD-Text only supports officially ASCII, Latin-1 and MS-JIS, but I wouldn't be surprised if there are unofficial encodings out there, given the strange strings I've seen in some bug reports. If you have a CD with garbled CD-Text, please submit a bug report indicating the performer, album name, language and attach the .toc file so that the produced strings can be compared to the expected text. Fixes whipper-team#169 Signed-off-by: Alicia Boya García <[email protected]>
I have several CDs that have possibly invalid CD-Text.
Whipper will raise this error when parsing it:
I added some debug to the toc parsing, and found that this is the value that it cannot parse properly:
(The TOC contains every slash once, not twice, but this is the representation of the string printed)
The simple fix would be to strip trailing slashes. I will do that now and test it.
The text was updated successfully, but these errors were encountered: