Working with metadata output from multiple library implementations #20

drewnoakes · 2018-02-12T20:20:10Z

Background

When there was only the Java implementation of metadata-extractor, the metadata subfolders in this repository contained the faithful output of that library for each of the images found in this repository.

When the .NET version was first introduced, it came very close to producing the same output (bar about 5 encoding-related differences), which wasn't too hard to manage.

Since then, both the .NET and Java implementations have seen community contributions that have given them increasingly different capabilities to extract additional kinds of data. The libraries thrive from such contributions, and we often see fixes and features ported between them.

Problem

The different capabilities between .NET/Java implementations makes the contents of the metadata folders confusing. They don't actually represent the faithful output of either library. Contributors have pointed this out a few times. I've also wondered if there's a better way of handling this.

Currently the files contain the "best" extraction output I've seen to date (even if neither library quite produces that output any more). I've treated them as containing the ideal output that both implementations might strive to achieve.

Question

I'd like to hear suggestions on how we might better manage this cached output data.

Requirements would be:

Ability to easily run a regression test for either of the implementations
Ability to easily identify differences between implementations

Would like to gather thoughts and suggestions before moving ahead with anything. Please chime in if you have any ideas.

The text was updated successfully, but these errors were encountered:

payton · 2018-03-16T22:51:40Z

My initial thoughts are to give each implementation its own branch.

master: responsible for testing data, but metadata folders are all empty
java: output metadata like normal getting data updates from master if necessary
dotnet: output metadata like normal getting data updates from master if necessary

Pros

Clear separation between implementation outputs
Easy comparison between outputs (git diff java..dotnet -- metadata.txt)
Doesn't require modification of ProcessAllImagesInFolderUtility

Cons

Some maintenance required if updating master with images (each branch must pull new images)

Example

I demonstrated this on my fork of the repo

git diff java..dotnet -- ico/metadata/Icon\ \(1\).ico.txt

diff --git a/ico/metadata/Icon (1).ico.txt b/ico/metadata/Icon (1).ico.txt
index a97d303..3e30669 100644
--- a/ico/metadata/Icon (1).ico.txt     
+++ b/ico/metadata/Icon (1).ico.txt     
@@ -2,8 +2,8 @@ FILE: Icon (1).ico
 TYPE: ICO
 
 [ICO - 0x0001] Image Type = Icon
-[ICO - 0x0002] Image Width = 48 pixels
-[ICO - 0x0003] Image Height = 48 pixels
+[ICO - 0x0002] Image Width = 1 pixels
+[ICO - 0x0003] Image Height = 1 pixels
 [ICO - 0x0004] Colour Palette Size = 2 colours
 [ICO - 0x0005] Colour Planes = 1
 [ICO - 0x0007] Bits Per Pixel = 1

drewnoakes · 2018-03-19T16:26:51Z

Hey Payton,

I like this idea. There's only one aspect I'd miss which is that currently the metadata files contain the 'best' output we've seen for each tag so far. It's possible that neither implementation can currently produce that output for whatever reason, but the current metadata files shows what we have at some point seen to be possible, and what we should aspire towards.

Another feature I'd like a solution to have is the ability to ignore certain differences at the file/tag level. For example, some obviously garbled text differs, but there's no need to fix such differences as the output is already rubbish:

diff --git a/jpg/metadata/samsung gt-i9300 (galaxy siii).jpg.txt b/jpg/metadata/samsung gt-i9300 (galaxy siii).jpg.txt
index 9231b47..58efc50 100644
--- a/jpg/metadata/samsung gt-i9300 (galaxy siii).jpg.txt	
+++ b/jpg/metadata/samsung gt-i9300 (galaxy siii).jpg.txt	
@@ -38,10 +38,10 @@ TYPE: JPEG
 [Exif SubIFD - 0x9207] Metering Mode = Center weighted average
 [Exif SubIFD - 0x9209] Flash = Flash did not fire
 [Exif SubIFD - 0x920a] Focal Length = 3.7 mm
-[Exif SubIFD - 0x9286] User Comment = �?�;
+[Exif SubIFD - 0x9286] User Comment = ��;
 [Exif SubIFD - 0xa000] FlashPix Version = 1.00
 [Exif SubIFD - 0xa001] Color Space = sRGB
 [Exif SubIFD - 0xa002] Exif Image Width = 3264 pixels

To achieve this we'd probably need custom tooling. Possibly a richer file format. I'm not sure it's worth it.

I suppose a third branch could be used for the 'best' output. I don't know how to apply selective patches between branches, but I'm sure it's possible. An opportunity to learn more git :)

So far your approach is the best idea, in terms of bang-for-the-buck 👍

kwhopper · 2018-03-20T20:41:42Z

Many of the less relevant differences are because of the garbled text. What if there was a comparison output mode that converts text byte arrays to Base64 strings? This removes encoding as a problem area.

kwhopper · 2018-04-11T21:15:19Z

What if the 'best' output is created using a third-party, like ExifTool, and we store that as the metadata standard? We know that is about as comprehensive as it gets and unrelated to MetadataExtractor, so community changes won't mess with the ideal. Maybe it's possible to have a special output that mimics the output from ExifTool as much as possible for change detection.

I'm afraid if either Java or .NET is favored at all, there may be few alternatives to what we're doing now. This doesn't address your other issues but might be a thing to consider early.

flemingm · 2018-05-10T03:21:54Z

Garbled text: Can we always convert to same text encoding (regardless the encoding stored).
Converting to UTF-8 should be my preferences.

Converting to Base6 strings make reading what the correct output should be hard. But good for comparing byte array.

Also for verification of changes, if you want to ignore selected tags or files, we can add filter to the output based on file name and/or tag.

kwhopper · 2018-05-10T14:21:01Z

Since it's only for long-term comparisons, maybe both Base64 and some common encoding like UTF-8 would be a good idea.

drewnoakes · 2019-07-15T10:17:55Z

...one aspect I'd miss which is that currently the metadata files contain the 'best' output we've seen for each tag so far. It's possible that neither implementation can currently produce that output for whatever reason, but the current metadata files shows what we have at some point seen to be possible, and what we should aspire towards.

I'm stepping back from the idea of having the "best output we've seen for each tag so far" as this is too hard to maintain automatically. If there's anything critical here, we can file issues against the implementations with details.

My initial thoughts are to give each implementation its own branch

I'm hoping we can do all this in one branch somehow. It'll involve a lot less tooling and ceremony, and make all content browsable directly on GitHub.

maybe both Base64 and some common encoding like UTF-8 would be a good idea

It'd be good to differentiate between differences in how the implementations extract string bytes versus differences in how Java and .NET decode strings from those bytes. The former is much worse. We could consider having binary metadata files that store raw output, and comparison code that identifies whether the difference is due to the extraction (i.e. the bytes differ) or the decoding (i.e. the strings differ). We could also capture what encoding the library used per-string. I'll provide more context on this idea below.

I've been thinking about how to get more use out of the metadata files in the image library repo and am leaning towards the following.

Each image type (eg. jpeg) would have the following folder structure:
```
jpeg
- metadata
  - dotnet
    - *.txt
  - java
    - *.txt
```
The output from each library implementation would be separated into different folders.
Rather than .txt files we could write .md files for nicer presentation, including showing the image inline for formats that have browser support (JPEG, PNG, BMP, GIF, ...).
In each metadata folder would be java-dotnet.diff and dotnet-java.diff files containing a list of any output files that differ between implementations.
In each metadata folder could be a generated README.md file, listing summary information about every file in that folder. Common data includes file size, number of directories, number of tags, number of errors. All of those (except size) potentially differ by implementation.
The code that generates these files would move into the image library repo itself, with the ability to launch both the dotnet and java versions of the library on the local filesystem (perhaps a sibling folder, or a submodule).

This is just a dump of ideas. There should be divided up into milestones to make it easier to achieve.

What do you think?

flemingm · 2019-07-15T21:41:58Z

Note. UTF8. Has several normalized form but is most like best form for text If needed I can elaborate more on this If converting from one character set to another must be lossless. Or flagged as error / loss of data

…

On Jul 15, 2019, at 6:17 AM, Drew Noakes ***@***.***> wrote: ...one aspect I'd miss which is that currently the metadata files contain the 'best' output we've seen for each tag so far. It's possible that neither implementation can currently produce that output for whatever reason, but the current metadata files shows what we have at some point seen to be possible, and what we should aspire towards. I'm stepping back from the idea of having the "best output we've seen for each tag so far" as this is too hard to maintain automatically. If there's anything critical here, we can file issues against the implementations with details. My initial thoughts are to give each implementation its own branch I'm hoping we can do all this in one branch somehow. It'll involve a lot less tooling and ceremony, and make all content browsable directly on GitHub. maybe both Base64 and some common encoding like UTF-8 would be a good idea It'd be good to differentiate between differences in how the implementations extract string bytes versus differences in how Java and .NET decode strings from those bytes. The former is much worse. We could consider having binary metadata files that store raw output, and comparison code that identifies whether the difference is due to the extraction (i.e. the bytes differ) or the decoding (i.e. the strings differ). We could also capture what encoding the library used per-string. I'll provide more context on this idea below. I've been thinking about how to get more use out of the metadata files in the image library repo and am leaning towards the following. Each image type (eg. jpeg) would have the following folder structure: jpeg - metadata - dotnet - *.txt - java - *.txt The output from each library implementation would be separated into different folders. Rather than .txt files we could write .md files for nicer presentation, including showing the image inline for formats that have browser support (JPEG, PNG, BMP, GIF, ...). In each metadata folder would be java-dotnet.diff and dotnet-java.diff files containing a list of any output files that differ between implementations. In each metadata folder could be a generated README.md file, listing summary information about every file in that folder. Common data includes file size, number of directories, number of tags, number of errors. All of those (except size) potentially differ by implementation. The code that generates these files would move into the image library repo itself, with the ability to launch both the dotnet and java versions of the library on the local filesystem (perhaps a sibling folder, or a submodule). This is just a dump of ideas. There should be divided up into milestones to make it easier to achieve. What do you think? — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

drewnoakes · 2019-07-18T00:18:05Z

Another possible outcome of such a change to the metadata handling here would be to produce a diff in output between released versions of the library (and master to the last release). That would potentially be quite useful.

flemingm · 2019-07-18T17:31:12Z

I agree able to validate extraction of blob and translation independently is good for unit testing

…

On Jul 15, 2019, at 6:17 AM, Drew Noakes ***@***.***> wrote: t'd be good to differentiate between differences in how the implementations extract string bytes versus differences in how Java and .NET decode strings from those bytes.

drewnoakes · 2020-01-27T12:06:45Z

dotnet and java output directories have existed for a while now.

I just pushed a utility program's source to this repo that allows updating both the Java and .NET outputs simultaneously. The source is here (for now at least):

https://github.com/drewnoakes/metadata-extractor-images/tree/master/src/dotnet

It also produces and updates a set of 'diff' files between these, in cases where files differ.

You can view an initial set of these in f961c13.

These diff files highlight things such as differing support between libraries, or output formatting differences. The number of diffs should ideally trend toward zero.

drewnoakes · 2020-01-28T13:01:55Z

Even for the .NET library, output varies depending upon target framework. For example net45 and net48 produce different output.

kwhopper · 2020-01-28T15:05:32Z

@drewnoakes Is there anything common with the differences or is it all over the place?

flemingm · 2020-01-28T21:15:10Z

Sample files and code would be useful

…

On Jan 28, 2020, at 8:02 AM, Drew Noakes ***@***.***> wrote: Even for the .NET library, output varies depending upon target framework. For example net45 and net48 produce different output. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

drewnoakes · 2020-01-30T02:08:13Z

@drewnoakes Is there anything common with the differences or is it all over the place?

@kwhopper there are definitely themes in the diffs. Picking up some of them shouldn't be too hard.

Sample files and code would be useful

@flemingm this is not the repo for that. Each library has its own sample code. If you'd like to see more or something different, please provide specifics.

This was referenced Jul 22, 2019

Generate metadata files under a 'java' folder drewnoakes/metadata-extractor#418

Merged

Generate metadata files under a 'dotnet' folder drewnoakes/metadata-extractor-dotnet#187

Merged

drewnoakes mentioned this issue May 11, 2020

Unit tests for images from metadata-extractor-images #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working with metadata output from multiple library implementations #20

Working with metadata output from multiple library implementations #20

drewnoakes commented Feb 12, 2018

payton commented Mar 16, 2018

drewnoakes commented Mar 19, 2018

kwhopper commented Mar 20, 2018

kwhopper commented Apr 11, 2018

flemingm commented May 10, 2018

kwhopper commented May 10, 2018

drewnoakes commented Jul 15, 2019

flemingm commented Jul 15, 2019 via email

drewnoakes commented Jul 18, 2019

flemingm commented Jul 18, 2019 via email

drewnoakes commented Jan 27, 2020 •

edited

Loading

drewnoakes commented Jan 28, 2020

kwhopper commented Jan 28, 2020

flemingm commented Jan 28, 2020 via email

drewnoakes commented Jan 30, 2020

Working with metadata output from multiple library implementations #20

Working with metadata output from multiple library implementations #20

Comments

drewnoakes commented Feb 12, 2018

Background

Problem

Question

payton commented Mar 16, 2018

Pros

Cons

Example

drewnoakes commented Mar 19, 2018

kwhopper commented Mar 20, 2018

kwhopper commented Apr 11, 2018

flemingm commented May 10, 2018

kwhopper commented May 10, 2018

drewnoakes commented Jul 15, 2019

flemingm commented Jul 15, 2019 via email

drewnoakes commented Jul 18, 2019

flemingm commented Jul 18, 2019 via email

drewnoakes commented Jan 27, 2020 • edited Loading

drewnoakes commented Jan 28, 2020

kwhopper commented Jan 28, 2020

flemingm commented Jan 28, 2020 via email

drewnoakes commented Jan 30, 2020

drewnoakes commented Jan 27, 2020 •

edited

Loading