-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Working with metadata output from multiple library implementations #20
Comments
My initial thoughts are to give each implementation its own branch.
Pros
Cons
ExampleI demonstrated this on my fork of the repo
|
Hey Payton, I like this idea. There's only one aspect I'd miss which is that currently the metadata files contain the 'best' output we've seen for each tag so far. It's possible that neither implementation can currently produce that output for whatever reason, but the current metadata files shows what we have at some point seen to be possible, and what we should aspire towards. Another feature I'd like a solution to have is the ability to ignore certain differences at the file/tag level. For example, some obviously garbled text differs, but there's no need to fix such differences as the output is already rubbish: diff --git a/jpg/metadata/samsung gt-i9300 (galaxy siii).jpg.txt b/jpg/metadata/samsung gt-i9300 (galaxy siii).jpg.txt
index 9231b47..58efc50 100644
--- a/jpg/metadata/samsung gt-i9300 (galaxy siii).jpg.txt
+++ b/jpg/metadata/samsung gt-i9300 (galaxy siii).jpg.txt
@@ -38,10 +38,10 @@ TYPE: JPEG
[Exif SubIFD - 0x9207] Metering Mode = Center weighted average
[Exif SubIFD - 0x9209] Flash = Flash did not fire
[Exif SubIFD - 0x920a] Focal Length = 3.7 mm
-[Exif SubIFD - 0x9286] User Comment = �?�;
+[Exif SubIFD - 0x9286] User Comment = ��;
[Exif SubIFD - 0xa000] FlashPix Version = 1.00
[Exif SubIFD - 0xa001] Color Space = sRGB
[Exif SubIFD - 0xa002] Exif Image Width = 3264 pixels To achieve this we'd probably need custom tooling. Possibly a richer file format. I'm not sure it's worth it. I suppose a third branch could be used for the 'best' output. I don't know how to apply selective patches between branches, but I'm sure it's possible. An opportunity to learn more git :) So far your approach is the best idea, in terms of bang-for-the-buck 👍 |
Many of the less relevant differences are because of the garbled text. What if there was a comparison output mode that converts text byte arrays to Base64 strings? This removes encoding as a problem area. |
What if the 'best' output is created using a third-party, like ExifTool, and we store that as the metadata standard? We know that is about as comprehensive as it gets and unrelated to MetadataExtractor, so community changes won't mess with the ideal. Maybe it's possible to have a special output that mimics the output from ExifTool as much as possible for change detection. I'm afraid if either Java or .NET is favored at all, there may be few alternatives to what we're doing now. This doesn't address your other issues but might be a thing to consider early. |
Garbled text: Can we always convert to same text encoding (regardless the encoding stored). Converting to Base6 strings make reading what the correct output should be hard. But good for comparing byte array. Also for verification of changes, if you want to ignore selected tags or files, we can add filter to the output based on file name and/or tag. |
Since it's only for long-term comparisons, maybe both Base64 and some common encoding like UTF-8 would be a good idea. |
I'm stepping back from the idea of having the "best output we've seen for each tag so far" as this is too hard to maintain automatically. If there's anything critical here, we can file issues against the implementations with details.
I'm hoping we can do all this in one branch somehow. It'll involve a lot less tooling and ceremony, and make all content browsable directly on GitHub.
It'd be good to differentiate between differences in how the implementations extract string bytes versus differences in how Java and .NET decode strings from those bytes. The former is much worse. We could consider having binary metadata files that store raw output, and comparison code that identifies whether the difference is due to the extraction (i.e. the bytes differ) or the decoding (i.e. the strings differ). We could also capture what encoding the library used per-string. I'll provide more context on this idea below. I've been thinking about how to get more use out of the metadata files in the image library repo and am leaning towards the following.
This is just a dump of ideas. There should be divided up into milestones to make it easier to achieve. What do you think? |
Note. UTF8. Has several normalized form but is most like best form for text
If needed I can elaborate more on this
If converting from one character set to another must be lossless. Or flagged as error / loss of data
… On Jul 15, 2019, at 6:17 AM, Drew Noakes ***@***.***> wrote:
...one aspect I'd miss which is that currently the metadata files contain the 'best' output we've seen for each tag so far. It's possible that neither implementation can currently produce that output for whatever reason, but the current metadata files shows what we have at some point seen to be possible, and what we should aspire towards.
I'm stepping back from the idea of having the "best output we've seen for each tag so far" as this is too hard to maintain automatically. If there's anything critical here, we can file issues against the implementations with details.
My initial thoughts are to give each implementation its own branch
I'm hoping we can do all this in one branch somehow. It'll involve a lot less tooling and ceremony, and make all content browsable directly on GitHub.
maybe both Base64 and some common encoding like UTF-8 would be a good idea
It'd be good to differentiate between differences in how the implementations extract string bytes versus differences in how Java and .NET decode strings from those bytes. The former is much worse. We could consider having binary metadata files that store raw output, and comparison code that identifies whether the difference is due to the extraction (i.e. the bytes differ) or the decoding (i.e. the strings differ). We could also capture what encoding the library used per-string. I'll provide more context on this idea below.
I've been thinking about how to get more use out of the metadata files in the image library repo and am leaning towards the following.
Each image type (eg. jpeg) would have the following folder structure:
jpeg
- metadata
- dotnet
- *.txt
- java
- *.txt
The output from each library implementation would be separated into different folders.
Rather than .txt files we could write .md files for nicer presentation, including showing the image inline for formats that have browser support (JPEG, PNG, BMP, GIF, ...).
In each metadata folder would be java-dotnet.diff and dotnet-java.diff files containing a list of any output files that differ between implementations.
In each metadata folder could be a generated README.md file, listing summary information about every file in that folder. Common data includes file size, number of directories, number of tags, number of errors. All of those (except size) potentially differ by implementation.
The code that generates these files would move into the image library repo itself, with the ability to launch both the dotnet and java versions of the library on the local filesystem (perhaps a sibling folder, or a submodule).
This is just a dump of ideas. There should be divided up into milestones to make it easier to achieve.
What do you think?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Another possible outcome of such a change to the metadata handling here would be to produce a diff in output between released versions of the library (and |
I agree able to validate extraction of blob and translation independently is good for unit testing
… On Jul 15, 2019, at 6:17 AM, Drew Noakes ***@***.***> wrote:
t'd be good to differentiate between differences in how the implementations extract string bytes versus differences in how Java and .NET decode strings from those bytes.
|
I just pushed a utility program's source to this repo that allows updating both the Java and .NET outputs simultaneously. The source is here (for now at least): https://github.com/drewnoakes/metadata-extractor-images/tree/master/src/dotnet It also produces and updates a set of 'diff' files between these, in cases where files differ. You can view an initial set of these in f961c13. These diff files highlight things such as differing support between libraries, or output formatting differences. The number of diffs should ideally trend toward zero. |
Even for the .NET library, output varies depending upon target framework. For example |
@drewnoakes Is there anything common with the differences or is it all over the place? |
Sample files and code would be useful
… On Jan 28, 2020, at 8:02 AM, Drew Noakes ***@***.***> wrote:
Even for the .NET library, output varies depending upon target framework. For example net45 and net48 produce different output.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
@kwhopper there are definitely themes in the diffs. Picking up some of them shouldn't be too hard.
@flemingm this is not the repo for that. Each library has its own sample code. If you'd like to see more or something different, please provide specifics. |
Background
When there was only the Java implementation of metadata-extractor, the
metadata
subfolders in this repository contained the faithful output of that library for each of the images found in this repository.When the .NET version was first introduced, it came very close to producing the same output (bar about 5 encoding-related differences), which wasn't too hard to manage.
Since then, both the .NET and Java implementations have seen community contributions that have given them increasingly different capabilities to extract additional kinds of data. The libraries thrive from such contributions, and we often see fixes and features ported between them.
Problem
The different capabilities between .NET/Java implementations makes the contents of the
metadata
folders confusing. They don't actually represent the faithful output of either library. Contributors have pointed this out a few times. I've also wondered if there's a better way of handling this.Currently the files contain the "best" extraction output I've seen to date (even if neither library quite produces that output any more). I've treated them as containing the ideal output that both implementations might strive to achieve.
Question
I'd like to hear suggestions on how we might better manage this cached output data.
Requirements would be:
Would like to gather thoughts and suggestions before moving ahead with anything. Please chime in if you have any ideas.
The text was updated successfully, but these errors were encountered: