Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CD-Text cannot be parsed, raising ValueErrors #169

Closed
MerlijnWajer opened this issue Jun 10, 2017 · 7 comments · Fixed by #633
Closed

CD-Text cannot be parsed, raising ValueErrors #169

MerlijnWajer opened this issue Jun 10, 2017 · 7 comments · Fixed by #633
Assignees
Labels
Accepted Accepted issue on our roadmap Bug Generic bug: can be used together with more specific labels Needed: discussion More discussion needed before anything can be done (or still no agreement has been reached) Needed: patch A pull request is required Priority: medium Medium priority Sprintable Small enough to sprint on
Milestone

Comments

@MerlijnWajer
Copy link
Collaborator

I have several CDs that have possibly invalid CD-Text.

Whipper will raise this error when parsing it:

u"(ValueError('Trailing \\ in string',)

I added some debug to the toc parsing, and found that this is the value that it cannot parse properly:

Read: key: u'TITLE', value: u'\267|\246\250\245\'

(The TOC contains every slash once, not twice, but this is the representation of the string printed)

The simple fix would be to strip trailing slashes. I will do that now and test it.

@MerlijnWajer
Copy link
Collaborator Author

This is code in question (image/toc.py):

value = value.decode('string-escape').decode('latin-1')

@MerlijnWajer
Copy link
Collaborator Author

This patch fixes it for me:

diff --git a/morituri/image/toc.py b/morituri/image/toc.py
index c83e940..7a5dab4 100644
--- a/morituri/image/toc.py
+++ b/morituri/image/toc.py
@@ -195,6 +195,15 @@ class TocFile(object, log.Loggable):
             if m:
                 key = m.group('key')
                 value = m.group('value')
+
+                # Occasionally, CDRDAO will contain CD-TEXT that ends with a
+                # slash. This will case value.decode('string-escape') to fail
+                # with a ValueError. We could -catch- that exception, but it
+                # might be more clean to just strip the trailing slash, as that
+                # seems to be the main/only issue right now.
+                while value.endswith('\\'):
+                    value = value[:-1]
+
                 # usually, value is encoded with octal escapes and in latin-1
                 # FIXME: other encodings are possible, does cdrdao handle
                 # them ?

@MerlijnWajer MerlijnWajer self-assigned this Jan 7, 2018
@MerlijnWajer
Copy link
Collaborator Author

Maybe we can just merge this? Seems useful and also harmless.

@JoeLametta
Copy link
Collaborator

JoeLametta commented Jan 18, 2018

Maybe the double \ is there because of something related to this??

The 12 payload bytes contain pieces of zero terminated data. When double-byte text is used the zero is a double byte, otherwise it is a single ASCII NUL.

https://www.gnu.org/software/libcdio/cd-text-format.html#Pack-Contents

@JoeLametta JoeLametta added the Bug Generic bug: can be used together with more specific labels label Mar 6, 2018
@JoeLametta
Copy link
Collaborator

JoeLametta commented Nov 3, 2018

@MerlijnWajer

Found the explanation:

CD-Text can also use the Shift JIS double byte encoding which, I think, cdrdao supports (and we're not handling in whipper). It's not allowed to mix Shift JIS and IEC 8859-1 (latin-1).

The character code is defined as follows:

$00         = ISO/IEC 8859-1 (modified, see CD EXTRA Specification, appendix 1)
$01         = ISO/IEC 646, ASCII (7 bit)
$02 .. $7F  = Reserved
$80         = Music Shift-JIS Kanji
$81         = Korean character code (to be defined)
$82         = Mandarin Chinese character code (to be defined)
$83 .. $FF  = Reserved

The character code indicates the character set used to code the character strings of the
PACKs with ID1 = $80 through $85. Other PACKs shall have character code $00 (ISO 8859-1
modified).

@JoeLametta
Copy link
Collaborator

JoeLametta commented Nov 3, 2018

What's the purpose of the string-escape decoding step?
To fix this issue, I think we should just escape the trailing backslash character (if present).

@JoeLametta JoeLametta added this to the 1.0 milestone Nov 3, 2018
@JoeLametta JoeLametta added Accepted Accepted issue on our roadmap Priority: medium Medium priority Sprintable Small enough to sprint on Needed: patch A pull request is required Needed: discussion More discussion needed before anything can be done (or still no agreement has been reached) labels Nov 13, 2018
@ntrrgc
Copy link
Contributor

ntrrgc commented Aug 25, 2024

Since cdrdao 1.2.5 (2023-02-03), by default the .toc file encodes strings in UTF-8 and backslash escape octal sequences are no longer used. The old behavior can still be accessed by passing --no-utf8 to cdrdao. However, the old behavior is undesirable because of a bug in the escape sequence generator: cdrdao/cdrdao#32

Even if that bug is eventually fixed and whipper passed --no-utf8, it must be noted that the no-utf8 mode encodes bytes, not characters. Therefore, this code in whipper is still wrong:

# usually, value is encoded with octal escapes and in latin-1
# FIXME: other encodings are possible, does cdrdao handle
# them ?
value = value.encode().decode('unicode_escape')

The 'unicode_escape' works only for ASCII and Latin-1, and only because their codepoints match exactly into Unicode codepoints. But CD-Text can also be encoded in MS-JIS (Shift-JIS) and treating those byte values as Latin-1 would produce mojibake.

ntrrgc added a commit to ntrrgc/whipper that referenced this issue Aug 27, 2024
This patch replaces the previous broken approach to TOC string decoding
that used `.encode().decode('unicode_escape')` with proper parsing of
the escape sequences cdrdao is known to generate.

The new parser is also lenient with invalid escape sequences, that can
occur due to improper escaping in cdrdao. See:
cdrdao/cdrdao#32

Latin-1:

This new parsing method should work for Latin-1 strings for both old and
new versions of cdrdao, as long as those strings don't trigger the
improper escaping issues in upstream cdrdao.

This has been verified with the album Diorama from the Danish black
metal band MØL.

MS-JIS:

This new parsing method should also work for MS-JIS strings as long as
the .toc file was generated by cdrdao 1.2.5+ and the strings don't
trigger improper escaping issues in upstream cdrdao.

Unfortunately, I don't have any CD with CD-Text in MS-JIS, so I could
not verify this.

cdrdao versions before 1.2.5 will still cause whipper to produce
mojibake (garbled characters) when reading MS-JIS CD-Text, as those
versions do not encode strings in UTF-8.

Other encodings:

As far as I know, CD-Text only supports officially ASCII, Latin-1 and
MS-JIS, but I wouldn't be surprised if there are unofficial encodings
out there, given the strange strings I've seen in some bug reports.

If you have a CD with garbled CD-Text, please submit a bug report
indicating the performer, album name, language and attach the .toc file
so that the produced strings can be compared to the expected text.

Fixes whipper-team#169

Fixes whipper-team#183

Signed-off-by: Alicia Boya García <[email protected]>
ntrrgc added a commit to ntrrgc/whipper that referenced this issue Aug 27, 2024
This patch replaces the previous broken approach to TOC string decoding
that used `.encode().decode('unicode_escape')` with proper parsing of
the escape sequences cdrdao is known to generate.

The new parser is also lenient with invalid escape sequences, that can
occur due to improper escaping in cdrdao. See:
cdrdao/cdrdao#32

Latin-1:

This new parsing method should work for Latin-1 strings for both old and
new versions of cdrdao, as long as those strings don't trigger the
improper escaping issues in upstream cdrdao.

This has been verified with the album Diorama from the Danish black
metal band MØL.

MS-JIS:

This new parsing method should also work for MS-JIS strings as long as
the .toc file was generated by cdrdao 1.2.5+ and the strings don't
trigger improper escaping issues in upstream cdrdao.

Unfortunately, I don't have any CD with CD-Text in MS-JIS, so I could
not verify this.

cdrdao versions before 1.2.5 will still cause whipper to produce
mojibake (garbled characters) when reading MS-JIS CD-Text, as those
versions do not encode strings in UTF-8.

Other encodings:

As far as I know, CD-Text only supports officially ASCII, Latin-1 and
MS-JIS, but I wouldn't be surprised if there are unofficial encodings
out there, given the strange strings I've seen in some bug reports.

If you have a CD with garbled CD-Text, please submit a bug report
indicating the performer, album name, language and attach the .toc file
so that the produced strings can be compared to the expected text.

Fixes whipper-team#169

Fixes whipper-team#183

Signed-off-by: Alicia Boya García <[email protected]>
ntrrgc added a commit to ntrrgc/whipper that referenced this issue Aug 27, 2024
This patch replaces the previous broken approach to TOC string decoding
that used `.encode().decode('unicode_escape')` with proper parsing of
the escape sequences cdrdao is known to generate.

The new parser is also lenient with invalid escape sequences, that can
occur due to improper escaping in cdrdao. See:
cdrdao/cdrdao#32

Latin-1:

This new parsing method should work for Latin-1 strings for both old and
new versions of cdrdao, as long as those strings don't trigger the
improper escaping issues in upstream cdrdao.

This has been verified with the album Diorama from the Danish black
metal band MØL.

MS-JIS:

This new parsing method should also work for MS-JIS strings as long as
the .toc file was generated by cdrdao 1.2.5+ and the strings don't
trigger improper escaping issues in upstream cdrdao.

Unfortunately, I don't have any CD with CD-Text in MS-JIS, so I could
not verify this.

cdrdao versions before 1.2.5 will still cause whipper to produce
mojibake (garbled characters) when reading MS-JIS CD-Text, as those
versions do not encode strings in UTF-8.

Other encodings:

As far as I know, CD-Text only supports officially ASCII, Latin-1 and
MS-JIS, but I wouldn't be surprised if there are unofficial encodings
out there, given the strange strings I've seen in some bug reports.

If you have a CD with garbled CD-Text, please submit a bug report
indicating the performer, album name, language and attach the .toc file
so that the produced strings can be compared to the expected text.

Fixes whipper-team#169

Signed-off-by: Alicia Boya García <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap Bug Generic bug: can be used together with more specific labels Needed: discussion More discussion needed before anything can be done (or still no agreement has been reached) Needed: patch A pull request is required Priority: medium Medium priority Sprintable Small enough to sprint on
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants