-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Please post here if you can run Meow on a large dataset! #7
Comments
Hello! In the end I found 13 collisions. There's some relevant information here but overall it'll be useless without the files themselves (see bottom for more information about them):
I'll try to contact owners of those files to see if they're okay with the files being shared here for research purposes. |
Stupid question @aveao but did you confirm the files were actually different? Is that what the following lines are? |
@Qix- They are different, yes. The filenames are their SHA256 hashes. |
@aveao That would be very helpful. Although I suspect we do not actually need the files if you can't get them, because honestly we probably won't update the main loop of Meow hash, so it would only matter if the output of the main loop was different but the hash was not. So maybe as a middle ground for now, could you send the complete values of S0123, S4567, S89AB, SCDEF as they are at the end of the function? If they are the same for the collisions, they are probably unfixable collisions. But if they are different, then it's our mixdown that is at fault and we can probably improve the mixdown to make them not collide, which would be very good to do! Thanks, PS. My e-mail address is [email protected] if you'd like to send files. |
I threw all of Battlefield 5's content at it. Edited to correct numbers, so people reading the thread don't read wrong info at the top. |
This is wonderful! Thank you very much! I will look at these today. - Casey |
Curious @tvandijck - any triple-pairs? As in, are there any three-or-more groups of files that all collide? Or are they all double pairs? |
@qix I have 120 pairs, 17 triplets, 4 quadruplets, and 2 quintuplets... |
Have to correct my stats.... my testing code was bugged :( crazy little typo and embarrassing stuff.. Anyway... with that corrected... 0 collisions for BF5. I'll see if I can throw Battlefront 2 at it later today. |
Battlefront 2 + Battlefield 5 data together in one database.... |
That sounds more like it :) Thanks for testing! So we are now down to only one person (@aveao) who has reported 13 collisions, but we have not heard anything more from them. Can we verify that these are real collisions somehow, and if they are, get some repro cases? I still have found zero collisions for Meow and so I really need more testing... - Casey |
In @aveao collisions the first "letter" of the SHA256 of every conflict happens to be identical on the two files colliding. That seems very peculiar, definitely seems worth double checking that test code. |
Is there some way to send a message to a GitHub user? If we can't contact @aveao, then I'm going to call it a misreport for now, since it does seem awfully suspicious and we can't get the files to verify. - Casey |
I've got access to 140TB of audio data, with files ranging in size from 10MB to 2GB, which are all unique in terms of SHA256. Would you be interested in me running meow_hash across it? |
@atruskie That sounds fantastic. |
@cmuratori I'm working on an official collision tester that I'll PR in by the way. Im also getting it to work on clang/MacOS. |
@cmuratori Heyo! Sorry for late reply. I didn't get a chance to check stuff yet, work and all. I'm not too experienced in C++ so it's possible that I made a mistake. I'll test with @Qix-'s collision tester once it's PR'd in. |
@Qix-, I put mine in a gist here: https://gist.github.com/tvandijck/e8ac50f01b6c656f5599d50b83e35ca9 it's windows only though... |
I'm debugging a few issues so it might be a little while. I'm using an mmap solution that I'll have to port to windows at some point (or just fall back to using regular streaming, but I wanted to avoid using buffers for the sake of I/O throughput). It's almost done, I'm just working through a bit of a puddle of platform issues, most of which I'll submit as separate PRs. |
Collision checker has been PR'd into #15. There are a number of dependency PR's and windows support needs to be tested (sorry). |
Yes, definitely, that would be awesome! I am working on the 0.2 release of Meow right now, and it will come with Linux/Windows buildable utility that checks directory trees for collisions. Please stay tuned :) - Casey |
Looking to use this as a block hash for a game deployment pipeline: ~400 builds of ~200-500MB each chunked into 1MB blocks. Total is ~40GB. Will test this sometime soon. |
I've run meow hash on 2709214 files on Linux distribution build folder which has all the source code, build temporary files, output packages, compiler, dependencies, sysroot - a lot of stuff. Total size 131GB. No collisions. Both for 512-bit hash output, or when truncated to first 128 bits. |
Ran it on approximately 41,000 separate 1MB chunks of mostly LZMA compressed game bundles. 0 collisions. |
Meow v0.2 is now available and includes a collision search utility called "meow_search". It should build on both Windows/MSVC and Linux/CLANG, so you can search Windows machines or Linux machines for files that produce hash collisions. Hopefully it is robust. Please report any bugs! It will report collisions for 128-bit, and also 64-bit and 32-bit truncations. I have not found any 128-bit or 64-bit collisions. 32-bit collisions are expected on anything in the tens-of-thousands range, so I have found some but they were expected - however it is still useful to note them, just in case they show up in suspiciously large numbers! The hashing function has been changed to be more efficient in this version, and may have been weakened, so if everyone who has a chance could re-run their datasets with v0.2, that would be very much appreciated! Barring major revelations, this is basically the construction Meow will use, so I'd like to get it thoroughly vetted against as many datasets as possible. Thanks, |
Just ran Meow v0.2 on parts of our Mercurial repository store. The "regular versioned source files" part:
The "large versioned binary files" part:
|
Well... I just ran Yes, 13 dupes, not collisions. Sigh. But why did they get different SHA256 hashes? Why did they look different when I observed them? Well that's up to us to determine now I suppose. It's good to know that I'm not going crazy, and that I can actually write acceptable C++. But yes, there's 2 meow 32-bit 128-wide collisions:
Wouldn't it be better to have a Makefile instead of a |
@aveao When you run your utility, and it reports 13 collisions, what happens if you run Beyond Compare or some other diff utility on one of the pairs it reports? Or, can you perhaps send me one pair that collides for me to look at, if indeed you still think they are not identical files? Thanks, |
Thanks very much! Glad to see there were no collision issues. - Casey |
Run again on my Linux build folder, no meow128 collisions:
Large amount of duplicates is expected result. |
@cmuratori they probably are identical, I doubt that we found 13 or even 1 sha256 collision. |
@aveao But I thought you said they had different SHA256 hashes? Is your SHA hashing code messed up, maybe? - Casey |
If you're looking for collisions, I found a 128-bit collision in https://github.com/dvyukov/go-fuzz-corpus
One consists of 65 |
Yes! That is very helpful. I will download the corpus and try to repro it. Which Meow hash was this? (v0.1 or v0.2) Thanks, |
This was at git sha1 67ac7f3, so currently HEAD. |
There is a new candidate for v0.3 (see the v0.3 branch). It does not collide on go-fuzz, but it has also not been tested on any of the large datasets that folks have tested prior, so we may have some regressions. Any testing of the new function would be greatly appreciated, and if you can find any collisions please send them! Thanks, |
Just ran Meow v0.3 on parts of our Mercurial repository store, similar to previous test on v0.2. TLDR: 32 bit hash has fewer collisions, yay! The "regular versioned source files" part:
The "large versioned binary files" part:
|
Excellent! - Casey |
We are now getting down to the nitty-gritty. v0.4 has been posted, and should hopefully provide the same collision resistance as v0.3, while now being substantially faster on small inputs (we are now the fastest smhasher-passing hash we know of for all input sizes, period). We will be doing a little more testing, but unless something new comes up, I will try to have a v0.5 branch up sometime before the end of the year that we can definitively test as the "final" Meow hash to be christened v1.0. Thanks everyone for your testing help! - Casey |
Ran 0.4 on the same dataset as before. 128 & 64 bit hashes still zero collisions, 32 bit now has 9 collisions where 0.3 had 7. Not sure if that's worth worrying about at all; I think the amount of collisions is ballpark where I would expect with only 32 bits of hash. If you really want to look at it, here's the files from that data set where 0.3 had collisions, and where 0.4 had collisions. |
Hi @cmuratori, Ran HEAD (0.4) on the ImageNet dataset 2017 from computer vision setting. 32-bit collisions ONLY! With this, I can conclude that meowhash is reliable and ImageNet2017 is a high quality dataset! Best fortunes, |
We have substantially updated the Meow Hash for v0.5. We believe it now has significantly improved collision resistance. If you would like to conduct testing, the v0.5 branch is now available here as a pull request for early access testing :) - Casey |
A bunch of test executables (test, bench, search) do not compile with the provided scripts (on Mac/Linux at least) due to wrong paths, but also the test programs seem to want to include files that are removed ( |
Ah crap - somehow the cpps were all the old ones. I admit I am not particularly good with the GitHub web client (or anything Git-related for that matter). They should be updated now. Keep in mind, though, we haven't posted a Mac/Linux build yet. So the build.sh is actually old, and may not work. This is Windows-only at the moment, although it probably isn't far off from working on Mac/Linux. - Casey |
@aras-p I pushed a new build.sh and meow_test.cpp today. Those are the only two things that had issues on Linux. Everything should be compiling now on Windows and Linux. I don't test on Mac, so YMMV there, but IIRC the Linux and Mac builds were very similar so it shouldn't be hard to get it running. - Casey |
Hash tests are hard to come by, so testing Meow has been a mix of us testing our dataset for collisions and verifying that smhasher doesn't find anything suspicious. But that's not really sufficient. Anyone particularly interested in trying out Meow who has a large dataset test, please post here so I can coordinate with you as I try to finalize the finer points of the implementation. It would be very nice to be able to verify that we don't break anything with changes!
Thanks,
- Casey
PS. And I guess even if you don't have a large dataset, if you just have some useful hashing case that might find collisions we don't know about, that helps too!
The text was updated successfully, but these errors were encountered: