-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hard coded endianess in TagDescriptor.GetEncodedTextDescription #313
Comments
@reinfallt can you share an image that fails to decode due to this? In cases where the spec is under-specified (or when common deviations from the spec occur in practice), we have two options:
However because this is a general issue, we have the If you find a reasonable heuristic for such cases, it would be great if you could share it. |
Here is a picture with little endian utf16 user comment tag: Exif SubIFD - User Comment = ?????? Should be: |
Thanks for the sample image. The Java library decodes this correctly by default, which the .NET one does not. We should find the implementation difference between the two. |
Currently Java interprets Changing .NET to match Java:
Java's text decoding appears to gracefully handle inverted Unicode, as Java handles both of these files in the same way. We may be able to replicate that here. Those files specify |
A product I am using hits this same bug as well, so I want to resurface it. It seems to me, there may not be any ambiguity in the specification. I am not a Java developer (long-time .NET and c++ before that), but in my experience, Here are a couple test images I've been working with. You'll notice, both have identical metadata, one with big-endian and one with little-endian. (The metadata for both was embedded using Here is a unit test I wrote that tests both images, works for big-endian, and fails for little-endian:
P.S. Also, I have never heard of a "reliable heuristic" for determining the "best fit" for endianness. Indeed, in many cases the same string of bytes can be interpreted as either UTF-16BE or UTF-16LE. I suppose if you see a long stretch with alternating zero and non-zero bytes, one could hazard a guess, but still that would feel like an unsafe idea to me. There have been so many bugs over the years, in all layers of OS and software, that were caused by bad assumptions when it comes to Unicode encoding. P.P.S. In the meantime I guess I'll look at seeing if I can patch the other product I mentioned, to try to use the raw StringValue instead and do the decoding itself, as you suggested above. (But that seems like a hack that doesn't benefit any of your other users.) Or redo my entire workflow to try to force everything to be tagged big-endian before the files ever reach this point... but that's the worst of all solutions. I am definitely curious about the situations where "experience shows" that using the endianness declared by II or MM would fail. |
I guess my problem is, once the metadata has been read, there seems to be no way for the software (which is using metadata extractor) to detect what has happened. For example, I use metadata extractor to read the metadata for a file, and it simply tells me the UserComment is It would be nice if the Apologies if I'm missing something and there is a workaround I am not seeing! |
After studying the code some more, I struck out on the fixes and workarounds I had in mind... that code for reading & initializing directory structures looks pretty particular, and I did not relish the thought of changing anything in order to get access to the byte order information in more places (although in my opinion, that would still be the best solution). Obviously, I was initially dismissive about the idea of a heuristic to detect endianness. However, having read a little bit, I realized I was conflating the problem of a general-purpose heuristic for detecting encoding, and a specific heuristic to just tell the difference between UTF-16BE and UTF-16LE. There are several things that can work fairly reliably there (though there are probably still use cases that are not perfect). I searched for a few minutes and didn't see an obvious one online, so i ended up implementing one myself. It's a little brute-force, but it works pretty well in my testing... both for the unit test I submitted above, and in the 3rd party software through which I became acquainted with metadata-extractor. I just added a couple of private static methods on class TagDescriptor that implement detection of little endian text, and changed a couple lines to invoke that code when necessary. I would be happy to submit a pull request if you're interested. at TagDescriptor.cs line 363:
|
It has been awhile but I remember looking at moving the decoding of the tags that use the GetEncodedTextDescription method from the Descriptors to the CustomProcessTag method in the ExifTiffHandler:
It would be a fairly simple fix. These tags currently use the GetEncodedTextDescription method: |
GetEncodedTextDescription uses hard coded big endian for unicode and little endian for utf32.
As far as I can tell, the endianess of tags like the UserComment tag seems to depend on the endianess of the file as a whole. I added a UserComment tag using Digikam in a little endian jpeg file and the tag was encoded as little endian unicode.
The raw data of the tag:
55 4E 49 43 4F 44 45 00 48 00 E5 00 72 00 69 00 67 00 20 00 73 00 61 00 6B 00
It makes sense since the spec doesn't mention any endianess specifically for this tag. I don't know about utf32 since it is not officially in the spec.
The text was updated successfully, but these errors were encountered: